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PREFACE 


Elementary Statistics and Applications is designed for a begin- 
ning course. Principles of gathering and presenting statistics, 
frequency-distribution analysis, probability theory and the 
normal curve, correlation, time-series analysis, and forecasting 
are included. Elementary sampling procedure, only so far as 
it is founded upon the assumption of normal sampling dis- 
tributions, is also included. 

No attempt has been made to include any of the less con- 
ventional methods of time-series analysis. Some are too mathe- 
matical for treatment in an elementary text. Others are so 
highly specialized or so subjective as to be unsuited for textbook 
material. Many of these are new methods that need to be 
further systematized, coordinated, and tested in the crucible of 
time and experience. 

The approach in this book is that of the teacher. The authors 
have been associated in teaching statistics for more than ten 
years. The manuscript of the present text evolved during those 
years in mimeographed form, modified from year to year as new 
theories developed and as teaching use required. The sug- 
gestions of students, whether consciously or unconsciously made, 
have helped formulate this book. Experience has\^showi(jrthat 
students gain a sense of the close association of statistics to 
reality from the brief discussions of the historical origin of ithpor- 
tant steps in the development of statistical theory that are 
included. 

The descriptions of frequency-distribution, correlation, and 
time-series analysis are first completed in their simplest aspects, 
with elementary illustrations. This enables the student to 
visualize basic method unmixed with the more advanced phases. 
More complex illustrations of practical application are then given 
in separate chapters or separate sections. This practice elimi- 
nates the apparent digression that seems to hamper the student 
when exposition of method and, complicated illustrations are * 



vi 


PREFACE 


intermixed, as in the conventional text. In separating the two,' 
moreover, the fact is recognized that the best order of presenta- 
tion for teaching is not the best order of procedure for working 
an actual problem. For example, the handiest method for 
making a frequency-distribution analysis is to set up a work sheet 
with a Charlier check and first calculate the moments or the k 
statistics; but the theories of the moments and of the k statistics 
are among the most difficult parts of the analysis to explain and 
are not therefore good introductory topics for the teaching of 
frequency-distribution analysis. In addition, the practical 
analysis of the frequency distribution introduces short cuts, cross 
checks, or other timesaving devices. The authors believe that 
this new arrangement will also prove to be a boon to research 
workers who may use the text as a reference book. 

The more advanced points of statistical theory pertaining to 
frequency curves and sampling analysis have been placed in a 
separate book entitled Sampling Statistics and Applications. The 
two books together constitute a set on the subject of Funda^- 
mentals of the Theory of Statistics. 

In both volumes, the authors have drawn freely upon the 
many monographs and the periodical literature that have 
appeared during recent years. Care has been exercised to 
make acknowledgment in footnotes to the sources of new ideas 
that have been incorporated into the authors' own development 
of the subject. To all these vigorous workers in the field, too 
nkAierous to be listed by name, the authors as well as other 
statisticians are greatly indebted. 

^--^More particularly the authors here acknowledge a debt of 
gratitude to several generous professional colleagues who have 
read parts of the manuscript with critical and judicious eye. 
Sidney W. Wilcox, Chief Statistician of the Bureau of Labor 
Statistics in the United States Department of Labor, made 
especially helpful suggestions for Chap. XIX, Index Numbers, 
for the chapters on probability theory, and for Parts I and II 
ol ^Elementary Statistics and Applications. John H. Smith of 
the Bureau of Labor Statistics, contributed many stimulating 
criticisms and suggestions that the authors believe inspired 
important improvements, Lester S. Kellogg, Bureau of Labor 
Statistics, read Chap. Ill, Sources of Statistics, and made sug- 
*gestions that led to a constructive reworking of that material.* 
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The authors are profoundly grateful for such generous assistance 
and wish to make full acknowledgment of their profes^onal 
indebtedness to these men. 

fhe authors are grateful to the International Finance Section 
of Princeton University for the financial assistance given Acheson 
J. Duncan some years ago to enable him to study statistics and 
rrfathematical economics with the late. Henry Schultz of the 
University of Chicago and with Harold Hotelling of Columbia 
University. The authors are indebted to those men, and to 
colleagues in the Mathematics Department at Princeton Uni- 
versity. The authors arc also indebted to Professor R. A. 
Fisher, also to Messrs. Oliver & Boyd, Ltd., of Edinburgh, for 
permission to reprint an abridged edition of Table III, Table 
of from their book Statistical Methods for Research Workers. 

Naturally, it is not to be supposed that the whole or any part 
of the manuscript carries the endorsement of the authors' former 
teachers or those who have helped with criticisms of the manu- 
script. The authors assume full responsibility for errors of 
theory or calculation that may be present in the volumes. 

James G. Smith. 

Acheson J. Duncan. 

V Princeton, N. J., 

August J 1944. 
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PART I 
Introduction 


CHAPTER I 

STATISTICS IN THE ARTS AND SCIENCES 

From the sixteenth century to the present day, modern sciences 
have stressed empirical method — the gathering of data by labora- 
tory experiment or by statistical observation. Laboratory 
experimentation has been more spectacularly employed in the 
natural sciences (biology, chemistry, physics, botany, and the 
like), and statistical observation has been more widely used in 
the social sciences (such as politics, economics, and psychology). 
Yet laboratory technique is used for some types of investigation 
in the social sciences, especially in psychology, education, and 
agriculture; and statistical technique is frequently employed 
in the natural sciences; for example, the modern kinetic theory 
of gases is a statistical argument. 

Economy and Flexibility of Statistics, Meaning of Statistics, 
Statistics and scientific method are of value wherever a mass of 
complicated facts exists and wherever those facts are amenable to 
quantitative expression. Qualitative knowledge must be con- 
verted into quantitative units of enumeration or of measurement 
before it becomes statistics. The quantitative units are either 
enumerative or measurement units. An enumerative unit 
depends upon proper definition of the objects to be counted; 
thus statistics may be compiled on the number of blue-eyed 
as compared with brown-eyed people, the number of yellow as 
compared with green beans, etc. A measurement unit depends 
upon contrivance of some unit of measurement for the purpose 
6f converting qualitative knowledge into quantitative expression; 
thus properly devised tests make it possible to measure intel- 
lectual aptitude on a scale so that certain quantity figures can 
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be depended upon to measure relative amounts of intellectual 
aptitude. 

Such quantitative description of facts makes it possible to 
give in a brief space a great amount of information. In orSier 
to accomplish this economy of time and space, however, it is 
of the greatest importance that the units of measurement or of 
enumeration be uniformly applied and that the nature of these 
units of measurement for observation or of enumeration be con- 
stantly kept in mind when the data are used. Furthermore, 
having always in mind the nature of the statistical units chosen 
as criteria of measurement, it is possible to arrange statistical 
data in such a manner as greatly to facilitate their interpretation. 

A large degree of flexibility is thus available when facts are 
expressed "quantitatively; and, so long as the original units of 
measurement are not obscured, it is possible for specific purposes 
to arrange and rearrange a given set of data. A part of this 
flexibility is due to the fact that otherwise long, time-consum- 
ing methods of analysis can be resolved into relatively simple 
mathematical operations. These short cuts and the savings of 
human effort they make possible in the search for truth are only 
possible where knowledge can be expressed quantitatively, 
which is to say, by statistics. In using these short-cut methods, 
however, it is necessary to be ever watchful for hidden incon- 
sistencies with the original units of measurement, for it is in this 
realm that many of the misuses of statistics are found. 

Economy and a high degree of flexibility are characteristics of 
statistics that weir fit them to serve a dynamic society's needs for 
analysis and formulation of policy. It is a lesson learned from sad 
but profitable experience that statistics are something more than 
the mere will to collect facts in quantitative form. Careful study 
by many scholars has given rise to rules of procedure that must be 
followed if the economy and flexibility of which statistics are 
capable are to be realized. These rules of procedure constitute 
the science of statistics, to several aspects of which attention 
should be directed for differentiation. Statistics'' is used 
brdadly to refer to the whole field of the quantitative approach to 
knowledge, including the gathering of data, problems of statistical 
measurement, statistical analysis, statistical theory, and scientific 
method in general. The word statistics " is also used to refer to 
any one of these parts of the whole subject. 



STATISTICS IN THE ARTS AND SCIENCES 


3 


Accordingly, while ''statistics^' is used in the broad sense 
indicated, it applies also more particularly and more accurately 
to compiled data that are systematic and quantitative expressions * 
of facts or events. 

The theory of statistics is also called "statistics." The theory 
of statistics is the body of principles that has been developed, 
partly a priori by the mathematical approach and partly by 
empirical methods, to serve as a guide for sound statistics and 
sound statistical method. Understanding of the theory of sta- 
tistics is required also for compiling statistics. Statistical 
theory is required because nearly all compilations of quantitative 
facts are samples and not complete enumerations and because the 
fundamental rules regarding units of measurement must be 
obeyed in statistical enumeration if the resulting data are to be 
homogeneous, that is, comparable one with another. 

"Statistics" also refers to statistical method , a term used to 
describe the process of interpreting facts by the use of statistics 
and statistical theory. Careful study of the assembled sta- 
tistical data, obtained in a manner to secure internal compara- 
bility and arranged in well-planned tables, may be used as a 
basis for judgments or action. Further quantitative treatment, 
however, may frequently give greater significance to the sta- 
tistics. Selected summaries may bring out many relationships 
that would be difficult to visualize if they were in tables of figures 
that had been compiled for general purposes. This additional 
quantitative treatment is of the nature of classification and 
summarization. It is called "statistical analysis" and includes 
the methods ot tabulation, graphs, averages, measures of varia- 
bility, correlation, index numbers, and similar quantitative 
analyses that have been developed. Judgments based on 
statistical analysis are called "statistical inferences." Sta- 
tistical method, then, consists of two parts, (1) statistical analysis 
and (2) statistical inferences. 

In recent years the word "statistics” has also been used to 
describe figures that have been obtained by statistical analysis; 
for example, arithmetic means, average deviations, measures of 
correlation, and the like, are all called "statistics," and any one 
of them alone is called a "statistic." 

The word "statistics" is thus used to mean all these various ^ 
things together and any one of them separately. This mayjnake^ 
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for confusion, and in the above discussion such usage makes it 
appear as if terms were defined by Use of the term defined; but 
such is established conventional, albeit confused, use of this all- 
inclusive word statistics/’ 

THREE TYPES OF DATA 

Emyirical vs. Experimental Data. Answering the accusation 
that their conclusions are so vague and unpredictable as to pre- 
clude scientific sanction, the social scientists have often pleaded 
that social studies cannot, like the theories of the natural sciences, 
be tested in the laboratory. The social sciences must rely only 
on statistics and empirical or historical methods. Social theories 
can be interpreted with respect to true life only if viewed in the 
light of a ceteris paribus assumption. The assumption that 
other things are equal, or unchanged, or in balance serves the 
social scientist in the same manner as controls over experimental 
conditions serve the natural scientists. 

With the development, on the one hand, of statistical methods 
in the natural sciences and the development, on the other hand, of 
experimental methods in the social sciences, this contrast is 
becoming less real. While it is still true that social science 
predominantly uses empirical or historical data, some important 
work has been done, and more important work appears in the 
offing, with experimental data in the fields of psychology, 
sociology, education, medicine, population studies, agricultural 
economics, and statistical control of quality of manufactured 
products. Such outstanding progress in the technical develop- 
ment of this experimental work has been made £s to constitute 
almost a special field called ‘‘design of experiments.”^ 

Design of Experiments. The arrangements for making the 
experiment and for recording the data therefrom constitute the 
design of the experiment. In designing an experiment, methods 
of so controlling the experiment as to prevent biased results mtist 
be devised. If, for example, the experiment is to test the effects 
upon cotton culture of a certain kind of fertilizer, several areas in 
various localities may be chosen in order to test the effect of 
the selected fertilizer under a number of climatic conditions. 
The design for the experiment must then plan also some means of 
measuring these various other influences, namely, temperature 

^ Fv^hbb, R. a., The Design of Experiments (1935). 
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and rainfall. Some method must also be devised for discovering,- ^ 
in the resulting data, not only how much of the productivity of 
cotton, is due to the fertilizer, but also how much is due to the 
differing qualities of the soil, to varying amounts of rainfall, and to 
varying levels of temperature. The design for the experiment 
must plan and organize the procedure so that from the resulting 
data it will in truth be possible to measure the net influence of the 
new fertilizer. 

Where cost is a consideration, and it seldom is not, an impor- 
tant part of the design of experiment is to decide to what extent 
to experiment, in other words, how small an experiment will give 
trustworthy results. Before doing this it must be decided how 
much precision in the results, for practical purposes, is required. 

The solution of some of the problems relating to design of 
experiment may be found by applying the theory of statistics. 
The solution to others is a matter of common sense, which some- 
times is more difficult to apply than might be supposed. 

Not only in such a case as testing the use of fertilizer, but in 
many problems, the researcher finds that a number of factors 
influence a given result. In agricultural phenomena, weather, 
climate, and other natural and human factors arc present; in 
medicine, age, sex, and other conditions affect the application of 
treatment; in biochemical and in psychological experimentation, 
many human and natural variables enter. When it is necessary 
for a given purpose of analysis to isolate one of several influences, 
the data can be so selected or the treatments so applied as to hold 
other influences constant. For example, if age and sex as well as 
inoculation affect the outcome of pneumonia cases, the inocula- 
tion can be tested by comparing inoculated and noninoculated for 
those of the same sex or age group. It has become the practice 
to call the noninoculated group the ‘^controF^ in the experiment.^ 

Hypothetico-observational Data, In addition to empirical and 
experimental data scientists make extensive use of a third type, 
namely, hypothetico-observational data.^ For example, in the 
physical sciences, that the moon is about 240,000 miles from the 
earth is a hypothetico-observational datum — no one has carried 

^Cf. Hill, A. Bradford, Principles of Medical Statistics (1939), pp. 4-8 
and 170-178. 

* Eddington, Sir Arthur, The Philosophy of PhysiM Science (1939), 
pp. 12-14. ^ 



6 


INTRODUCTION 


out the experiment of measuring the distance from the earth to 
the rpoon; yet on the basis of certain hypotheses it is measured to 
a comparatively high degree of precision. In the social sciences, 
index numbers purporting to measure such items as the general 
level of prices are hypothetico-observational data. In both 
illustrations, upon the basis of certain hypotheses or theories, 
practical methods are devised for estimating the measurement in 
question. In appraising the resulting estimate, the precision of 
the underljdng theory or hypothesis is of primary importance. 

SERVING THE ARTS AND SCIENCES 

Statistics and the Social Sciences and Arts. Politics. Public 
opinion, the opinion of the masses, can be ascertained at any time 
on a wide variety of social and political issues by means of 
statistical data collected by random questionnaires from a com- 
paratively small number of people. The employment of sta- 
tistical technique for this purpose has stirred the imagination 
and stimulated the ingenuity of students of the social and 
governmental processes. The widespread demand for such 
information and the relatively low cost of obtaining it by the 
sampling method have also gratified the acquisitiveness and 
lined the purses of a number of enterprising polling agents. 
Increasingly, political strategists appear to pay attention to these 
systematic statistical studies of public opinion. Both the major 
political parties in the United States have had expert statisticians 
engaged during the quadrennial presidential campaigns to keep 
their fingers on the pulse of public opinion. 

It has been claimed that ‘^sampling referenda^ make the mass 
articulate, define the mandate of our leaders, reveal the true 
popular strength of pressure groups, and show social taboos 
quantitatively for what they are worth, . . . ^^ that they are, in 
the language of journalism, ‘Hhe fourth dimension for the Fourth 
Estate. 

Oovemmental Administration. Statistics are extensively used 
as guides to various kinds of governmental administration, such 
as sanitation, hospitalization, highway supervision, and public 
industrial accident and compensation insurance laws. For exam- 
ple, on the assumption that industrial accidents are due to unsafe 

^Gallup, George, Government and Sampling Referendum,” Journal 
of the American Statistical Assodationj Vol. 33 (1938), pp. 131-142. 
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conditions and unsafe practices that, if eliminated, would 
prevent repetition of the same or similar types of accident, 
statistical data on causes of accident have been assembled. 
Stufiy of these data enables the statistician to identify and select 
the unsafe elements in transportation conditions and then to 
present the data to safety engineers for guidance in accident 
prjBvention.^ 

From its beginning in 1790 to the present day, the Federal 
government has considered statistics on foreign trade so impor- 
tant that an organization has been maintained for the express 
purpose of assembling such statistics. In the early years of the 
republic they were gathered by the Treasury Department, but 
now they are collected and published regularly by the Depart- 
ment of Commerce. With the rapid development of large-scale 
business organization in the latter half of the nineteenth century, 
public policy with respect to social an^ economic conditions has 
required the Federal government to maintain a Bureau of Labor 
Statistics which has been engaged principally in the task of col- 
lecting and publishing statistics on prices, cost of living, and 
wages and, in more recent years, on employment and pay rolls in 
manufacturing industries of the United States. 

It is a matter of common knowledge to all whp read newspapers 
that important laws are passed by city, state, and Federal gov- 
ernments on the basis of statistical facts assembled regularly 
or collected by special legislative committees. For example, the 
Federal Reserve System of banks in the United States was created 
in 1913 after a thorough study, involving extensive use of sta- 
tistics, of the banking situation in this and other countries; 
legislation in the decade of the 1930's on public works, relief work, 
and social security was largely based on studies of a statistical 
nature. 

Business Administration, Statistics are valuable in business 
administration, enabling the manufacturer executive to obtain 
more or less satisfactory answers to such a perplexing question as: 
Making allowance for seasonal changes and expected prices of 
substitute goods, what will be consumer demand for the coming 
year? Some manufacturers must make estimates for a year in 
advance; others can proceed successfully with monthly estimates. 

^ Kossoeis, M. D., “A Statistical Approach to Accident Prevention, 
Journal of the American Statistical Association, Vol. 34 (1939), p. 626. 
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Retail-store executives frequently require weekly or even daily 
estimates on some articles, while sellers of perishable vegetables or 
other foods may even have to make hourly estimates. 

When the manufacturing executive has a satisfactory ansv^er 
to the above type of question, he can schedule production to 
maintain as nearly level a rate as is feasible and to keep as 
constant a labor force as possible. In some large business 
enterprises statistics are assembled daily on working capital 
position, factory expense, output, and consumer credit extended. 
Control by the executive is kept flexible and timely by a con- 
tinuous stream of statistics both on the internal state of the 
business and on external economic conditions. As one rather 
erudite businessman says, There has been an insistence from the 
very top of the organization on getting the facts, so that we might, 
to apply Descartes’s picturesque phrase, 'be clear about our 
actions and walk surefootedly this life.’ 

In his determination of policies regarding prices, production, 
and employment in his own business, the enterpriser must 
make judgments based upon knowledge of the world of prices in 
which he lives. * Prices he must pay for raw materials, for labor, 
for equipment and its upkeep are his guide for determining his 
own production activity and the price he can eventually obtain 
for his product. Since all or at least part of the system of prices, 
that is, the prices he pays and the prices consumers pay for 
competitive or substitute articles, is beyond his control, the- 
individual producer adapts his plans to any uncontrollable condi- 
tions he finds in the market. It is by the use of statistics that the 
modern businessman comes to understand conditions to which, if 
he is to profit, he must succeed in adapting his own business. 

During recent years polling agencies have been hired by busi- 
ness executives to obtain certain types of information with 
respect to potential markets and changes in consumer tastes or 
habits. Student groups and student publications on the cam- 
puses of colleges and universities are employed by businessmen to 
n^ake widespread use of polling techniques. It has also been 
found that a carefully conducted student poll can do more 
to make college administrators and trustees cognizant of student 
attitudes toward vital campus issues than the older and less 

^Hayford, F. L., ‘‘Some Uses of Statistics in Executive Control,'^ 
Journal of the Arnmctm Statietical Association, Vol. 31 (1936), pp. 31-37. 
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effective means of circulating petitions. In the large university 
the student poll performs many of the functions of the open 
forum /n a small university or college. Similarly, merchants and 
clfi&ses in advertising can determine the efficacy of advertising 
by the extent to which students express a preference for branded 
and highly advertised cigarettes, toilet articles, school supplies, 
and items of clothing to the little or nonadvertised varieties. 
The radio programs to which students listen, the magazines to 
which they subscribe, the amounts they spend for various 
budgetary items, the type of motion picture they most enjoy, 
the mileage they travel, and the means of transportation they 
prefer are typical items of information eagerly sought by adver- 
tising organizations and business firms in college and other kinds 
of markets as well.^ 

In a wide variety of practical ways the statistical principle 
of sampling is used in business. For example, by the use of a 
small spectroscope, an entire trainload of pig iron can be tested. 
The spectroscopist opens the car door, fastens a wire to a sample 
pig, strikes an electric arc between this and a bar of pure iron he 
carries, and observes the light in the spectroscope. The bands of 
color in the spectroscope reveal to him whether or not the amount 
of impurity in the pig is below a previously determined standard. 
By properly selecting sample pigs at random the trainload of 
metal can be tested before it is unloaded. ^ In a similar manner, 
though perhaps with less sensational instruments than the 
spectroscope, other types of more or less homogeneous or stand- 
ardized goods, such as shipments of ores, grains, oranges, potatoes, 
or lettuce, canS^e tested by sampling. 

Education, The grading and selection of teachers have in some 
instances been based upon intelligence tests, which have been 
perfected by the use of statistical technique correlating test 
grades with empirical results.^ The scientific use of intelligence 

^ For further illustrations see, for example, W. B. Dygert, Radio as an 
Advertising Medium; H. E. Agnew and W. B. Dygert, Advertising Media; 
E. R. Walter, Effective Marketing; E. H. Schell and F. F. Gilmore, Manvel 
for Executives and Foremen; and H. B. Maynard and G. J. Stegemerten, 
Operation Analysis, 

* Harrison, G. R., Atoms in Action (1939), p. 165. C/., on sampling for 

grading cars of iron ore, Stewart H. Holbrook, Iron Brew (1939), pp. 164-165. 

* West, Michael, *‘The Psychology of the Teacher,” Journal of Educor 
tion, March, 1939, p. 158. 
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tests has developed since the First World War. In 1917 a test 
called the American Army Intelligence Test was given to the 
drafted soldiers. The set of questions included on the. Army 
Intelligence Test were based upon the cumulative experieifce, 
comparatively limited in extent with such tests up to that time. 
The war experience with the tests proved to be a landmark in 
their development in that it constituted a major experiment 
in their use and stimulated rapid development in the principles of 
their use.^ Subsequently, the art of constructing questions for 
testing intelligence, now called aptitude in order to contrast the 
testing of natural ability with the mere testing of acquired ability, 
has greatly progressed. In addition to the college-entrance tests, 
which in part measure opportunity, scholastic-aptitude tests are 
used by the leading universities as a basis for selecting students. 
As a consequence, statistical data that measure not only acquired 
intelligence but also native ability, or aptitude, are being accumu- 
lated. The aptitude-test rating is often called the ‘‘intelligence 
quotient/' or simply I.Q. 

Mental tests have most frequently been employed with the 
feeble-minded in connection with problems of detection and place- 
ment and for determining the type of training best suited to 
individual persons. Studies of criminals by the use of intelligence 
tests disclose relationships between intelligence and the type 
of crime committed, but apparently a high I.Q. neither prevents 
nor stimulates crime in general. Delinquent children have been 
found to exhibit more neurotic traits than do unsolected school 
children. Tests of emotional control, dishonesty, and lack of 
self-control have been found useful in forecastinj^ incorrigibility 
among delinquent children. 

Recently a study was made in which the I.Q.'s of 214 foster 
children, all of whom were adopted before the age of twelve 
months, and of 105 control children living with their own parents 
were compared with the I.Q.'s of the foster and real parents. The 
I.Q. of the parents was supplemented by information on occupa- 
tmnal status and other pertinent data. Information regarding 
the true parents of adopted children were secured from placement 
records. There was far greater correspondence between the 

^ Brigham, Carl C., ‘'Two Studies in Mental Tests,'' Psychological 
Review^ Psychological Monographs, Vol. 24 (1917); A Study of American 
Intelligence (1923); A Study of Error (1932). 
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I.Q/s of foster children and their true parents than between 
the I.Q/s of foster children and their foster parents. It was 
estimated by statistical techniques that the contribution of 
hefedity to individual differences in I.Q. is probably not far from 
70 to 80 per cent and that the very best environment might, how- 
ever, raise I.Q. as much as 20 points, while the poorest environ- 
njent might lower it as much as 20 points.^ 

Sociology. Modern sociology employs statistical method 
almost to the exclusion of other methods. This may be a mis- 
taken emphasis that will be corrected by future sociologists, 
but in that discii)line the twentieth-century reaction to nine- 
teenth-century abstraction was particularly great. Moreover, 
this extreme emphasis upon fact by American sociologists^ can 
be traced to the picturesque Lester Frank Ward, who, despite 
the abstract qualities of his writing, emphasized the statistical 
approach. A farmer, a Civil War soldier, a Federal government 
official, a lawyer, a botanist, a chief of the Division of Navigation 
and Immigration, and, finally, toward the end of his life, a pro- 
fessor of sociology at Brown University, Ward came to the study 
of sociology with a richly varied experience. Among the voices 
raised against nineteenth-century emphasis on nature and the 
neglect of h\imanity his was the most vigorous. So eager was the 
reading world for this new approach that some of Ward's books 
were translated into every Continental tongue.^ 
y^Economic Theory. From Adam Smith to the present time 
economic theory has been, at least in part, an inductive science. 
In Adam Smith's day there were few statTstics, but he made 
extensive use ol trade, price, and w^age data in his analyses. In 
modern times, especially since the turn of the century, more 

^ Study by Miss Burks, described in H. E. Garrett and M. R. Schneck, 
Psychological TestSj MethodSj and Results (1933), pp. 189-190. 

^Lynd, K. S., and H. M. Lynd, Middletown: A Study in Contemporary 
American Culture (1929); Middletown in Transition: A Study in Cultural 
Conflicts (1937). These remarkable books are modern classics in sociology 
and are based almost entirely upon observational method largely statistical 
in character. ^ 

^ Chugerman, Samuel, Lester F. Ward^ The American Aristotle (1939). 
Because of Ward^s optimistic views it has been suggested lately that he should 
be widely read both for information and encouragement. Cf. a review of 
Chugerman^s book by Prof. Rudolph Binder in The New York Times Book 
Review^ Oct. 15, 1939, p. 10. * 
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complete statistical data are available, and an increasingly 
important volume of statistical writings that have significance 
in economic theory has been forthcoming. The theory of alloca- 
tion of income to the owners of capital, to the laborers, and to the 
enterprisers, as well as other comprehensive economic hypotheses 
such as business-cycle theories and theories of money and prices 
are today being tested by careful statistical studies.^ In addi- 
tion, theoretical questions concerning the factors determining par- 
ticular prices are being studied by the use of statistical methods. 

Mathematical Economics, In recent years the subject of 
statistics has become closely related to the mathematical approach 
to economic theory. Starting with the nineteenth-century work 
of Cournot a group of mathematical economists have attempted to 
work out a purely abstract theory of economics by using the 
shorter and more precise methods of mathematical reasoning. 
The origin of this school of economists, often called the ^^mathe- 
matical school of economists,^' was independent of the develop- 
ment of statistics. As statistical methods became more refined 
and economic data more plentiful and more accurate, the mathe- 
matical school of economists turned to statistics to derive demand 
curves and supply curves from the actual statistical events of the 
market place. This development during the 1930's was one of 
the most sensational and also one of the most controversial 
contributions to economics, and it continues to be energetically 
discussed in scientific meetings and journal articles by proponents 
and opponents of the methods used. Meanwhile, without waiting 
for the issue to be settled by the theoreticians, business enterprise 
and government, and notably the United States Department of 
Agriculture, are making extensive practical use of statistical 
demand and cost curves.^ 

Statistics and the Natural Sciences and Arts. Astronomy, 
One of the foundations of statistical theory, the method of least 

^ C/. National Bureau of Economic Research, Studies in Income and 
Wealth, Vols. 1--3. National Resources Committee, The Structure of the 
American Economy (1939), Part I, Basic Characteristics. Tinbergen, J., 
A^Method and Its Application to Investment Activity (1939); Business Cycles 
in the United States of America, 1919-1932 (League of Nations Economic 
Intelligence Service, 1939). 

* Ezekiel, M., and L. H. Bean, Economic Bases for Agricultural Adjust-^ 
ment Act (1933); Schultz, Henry, Theory and Measurement of Demand 
► (1938). 
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squares, was first discovered and applied in astronomy early in the 
’nineteenth century. The method continues to be employed in 
astronomy to trace the paths of stars, comets, planets, and other , 
heavenly bodies. Modern astronomy deals with large numbers 
of observations, which become the statistical raw material for the 
science. For example, the Harvard College Observatory receives 
jmonthly, from nearly one hundred different observers distrib- 
uted the world over, and on report blanks containing seven 
to seven hundred observations each, an average of forty-five 
hundred observations. It has been found best not to attempt to 
analyze each observer’s work separately, but instead to depend on 
multiplicity and frequency of observations well distributed 
throughout, to obtain the best possible light curves. Over fifty 
thousand observations come to the Harvard College Observatory 
each year, and from 1911 to 1939 it collected three-quarters of a 
million observations.^ 

For years the Smithsonian Institution has been using methods 
essentially statistical in nature to record measurements of the 
amount of heat received from the sun by the earth. Smithsonian 
stations in three of the most arid regions of the earth are daily 
recording the sun’s radiation. Observers in Chile, in South 
Africa, and in Western United States have been taking records. 
According to these observations, which have been made at widely 
separated stations, correlations exist between changes in solar 
radiation and temperatures on the earth. Study of these records, 
study of records of the earth’s weather as recorded in the growth 
rings of trees, and study of similar phenomena have revealed 
recurrent cycies in the weather that may be of great value in 
foretelling long-range trends in the future succession of fat and 
lean years. ^ 

Zoology, A considerable amount of the experimental work in 
the life sciences involves such quantitative • considerations as 
weights, measurements, enumerations, pointer readings of various 
kinds, comparisons, and classifications. If the results arrived at 
by experimentation are to give rise to general principles rather 

^ Cf. Campbell, Leon, ^^The Light Curve of SS Cygni, 213843,^' Annals 
of Harvard College Observatory ^ Vol. 90, No. 3, pp. 93-162; Sterne, T. E., and 
Leon Campbell, Properties of the Light Curve of SS Cygni,” ibid.y Vol. 90, “ 
No. 6, pp. 189-206. 

* So says G. R. Harrison, op, ciLy pp. 290-291. * 
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than just to meaningless and incoherent single observations, the 
zoological data must be consistently assembled, uniformity of 
, units must be observed, and the data classified. In other words 
statistical method must be used to bring order from isolated 
chaotic measurements. 

In addition to routine problems of analysis in zoology, sta- 
tistical and mathematical devices have had interesting applica-, 
tions in certain special problems. For example, in 1934 Zeuncr 
used a statistical study of a system of cranial angles as a basis for 
biological inferences regarding rhinoceroses; in 1930 Soergel 
emphasized the importance of statistical methods for certain 
paleontological problems, employing numerical and mathe- 
matical procedures to study footprints and from these drawing 
inferences regarding the animals that made them; and in 1912 
Ridgway attempted to put the study of faunal coloration on a 
statistical basis. ^ Paleontologists use various mathematical 
and graphic means to restore missing parts in fossil animals and 
to reconstruct hypothetical intermediate stages between less and 
more specialized animals. They also use statistical methods 
to study averages and variation in characteristics of different 
age groups, rate of growth, and the like, of various animals.^ 

Biology, Considering the modern emphasis on statistics in the 
social sciences it is interesting to note, not only that the method of 
least squares was first applied in a natural science, but also that a 
second highly important statistical method was first developed in 
the natural science of biology. This is the statistical measure- 
ment of correlation, which in the 1870^s was used by Sir Francis 
Galton to measure the effect that characteristics of tnidparents — 
that is, the average of their two parents — had on their children. * 

Biological experimentation in the nineteenth and twentieth 
centuries involving as it does rats, guinea pigs, and the like, 
makes use of procedures that combine the laboratory test with 
the assembling of statistical data and their subsequent analysis. 
In this way, the effects and incidence of various diseases and 

^feoERGEL, W., Die Bedeutung variaiionsstatistischer Uniersuchungen fur 
die Sdugetier — paldontologiej Bund 63, pp. 349-450; Ridgway, R., Color 
Standards and Color Nomenclature (1912). Also see Simpson, G. G., and 
• Anne Roe, Quantitative Zoology, (1939), pp. 24, 404-406. 

* Simpson and Roe, op. dt., p. 335. 

• See Chap. XIII. 
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of various cures for those diseases are measured; thus also are 
tested the various theories regarding the relative importance 
of hcjreditary factors as compared with environmental factors in * 
^imal life. Much of this experimental work later becomes the 
basis for theories regarding human life and for theories in respect 
to the effects of human diseases and their cure. 

, Some problems in biology have interesting applications to 
the homely arts of living. A recent illustration of the use of 
statistics in biology is the standardizing of liquid household 
insecticides, a matter of considerable importance to certain 
private enterprisers engaged in the business.^ By a series of 
experiments that established the sex ratio of houseflies statis- 
tically, hitherto unknown sources of variability in the effects 
of insecticides were thrown into bold relief. It was found, 
for example, that flies at ages of less than three days vary con- 
siderably in their reaction to the spray, while flies four to six 
days old exhibit a fairly constant susceptibility. It was known 
that male houseflies are markedly more susceptible to certain 
sprays than female houseflies. 

A recent book on heredity^ illustrates the extent to which 
biology depends upon statistical techiiiciue. Widespread interest 
in the Dionnes led biologists to calculate the probability of 
quintuplets as compared with the probability of twins. The 
probability of (puntuplets is 1/41,600,000, while that of twins 
is In addition, the statistical method was used in an interest- 
ing way to answer the question of heredity vs. environment, 
epitomized in the highly talented musical family of Johann 
Sebastian BSch, a talent that ran through five generations. 
Were the Bachs musical because of inborn talent or because 
of the musical environment in the home? To answer this 
question the author of the above-mentioned book resorted to 
statistical technique. He obtained information from 36 out- 
standing instrumental musicians, from 36 principals of the 
Metropolitan Opera Company, and from 50 students of the 
Juilliard Graduate School of Music. From facts obtained^ by 

^ Campbell, F. L., G. W. Snedecor, and W. A. Simonton, “ Biostatistical 
Problems Involved in the Standardization of Liquid Household Insecti- 
cides,*’ Journal of the American Statisiical Associaliony Vol. 34 (1939), 

pp. 62-80. 

® ScHEiNFELD, A., You and Heredity (1939). 
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questioning these persons, the author concludes that their 
talent is largely inherited. Many will welcome this trend toward 
basing studies of man upon statistics of human beings rMher 
than upon statistics of vegetables or fruit flies. ^ 

Medicine, Much of the statistical work in biology has 
application in the field of medicine, and interest in statistics 
on the part of the medical profession has increased. In addition,* 
the medical profession has become interested in statistics on 
^onomic and social welfare, factors of importance in the control 
of epidemics, and of certain types of disease in the modern com- 
munity.^ The practical advantages to the physician and to the 
sanitarian of the development of medical statistics are very 
great. Matters that were fiercely debated two generations ago 
and concerning which only few physicians of a hundred years ago 
could form an opinion are now a regular part of the knowledge 
of a junior medical student through the study of mortality 
statistics and vital statistics. ^ Indeed, the medical profession 
in England has recently contributed a textbook on medical 
statistics designed to acquaint medical students mth the funda- 
mentals of statistical theory.® 

Engineering, Since the success of their work depends not 
only on the machines but on the human beings who operate them, 
mechanical engineers have become increasingly interested in the 
use of statistical method for making time studies in machine 
operations. It is now realized that such studies cannot be 
safely based upon some a priori scale of the machine's capacity 
or upon the record of only one or two operatives. Rather, 
time-study data must be collected from an entire group of 
operatives so that adjustments can be made according to the 
effects upon operation of the human traits found by statistical 
study to prevail in the machine or in the manner of operation.^ 

Two simple examples of the application of statistics to electrical 
engineering are the study of elevator capacity for buildings and 

1 Cf. Davis, Michael M., Wanted: Research in the Economic and Social 
Asfipcts of Medicine, The Milbank Memorial Fund Quarterly, October, 1935, 
pp. 330-346. 

* Cf, Pearl, Raymond, Introduction to Medical Biometry and Statistics 
(1923), pp. 2, 38. 

* Hill, A. B., Principles of Medical Statistics (1939). . 

* Bergen, H. B., ** Scientific Management in Unionized Plants,** Mechani- 
cal Engiffeering, March, 1938, pp. 235-240. 
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telephone calls to be handled by an exchange. Statistics 
regarding the number of passengers taken on at the first floor 
are used to determine the time required for passengers to leave 
the elevator, the round-trip time, and the number of passengers 
carried by a given elevator. The most desirable type of elevator 
equipment to install is determined from such data.^ 

, Since engineers are dealing with natural phenomena that 
cannot be affected by human bias, many of their problems can 
be solved approximately by the application of the principles 
of probability. For example, during a long period of gaugings of 
a stream the frequency of floods is often the best indication 
of probable future floods. Such important engineering data 
as forecasts of future floods, low annual rainfall, and consequent 
depletion of storage reservoir can be estimated by applying 
the theory of probabilities to statistics on the past history of 
such events. From such data, the use of statistical technique 
makes it possible to estimate the proper size of a hydroelectric 
power plant and to predict its output and earnings. ^ 

One of the most striking illustrations of the use of statistics 
in engineering is the control of the quality of manufactured 
products.^ In ordinary manufacture, vnih. the exception of 
the making of optical or other precision instruments of infinite 
refinement, all units of a product are not identical, in spite of 
the vaunted standardization of products in industry. The cost 
of so refining the machines or of so regulating their operation as 
to make all units of product identical would be prohibitive 
and in most cases unwarranted because of the low market value 
of the product: Variations in quality are thus considered to be 
justified, and it is the purpose of quality control to develop 

^ Cook, H. B., “Selecting Elevators for an Office Building,^' Power, Mar. 8, 
1932, pp. 404-408. 

* Creager, W. P., and J. D. Justin, Hydro-electric Handbook (1927), 
pp. 43 and 171. For other illustrations of the use of statistics in engineering 
science and art, see C. W. Hubbard, “Investigation of Errors of Pitot 
Tubes,'' Transactions of the American Society of Mechanical Engineers, 
August, 1939, pp. 477-506; H. K. Barrows, Water Power Engineering (1927), 
pp. 54-57. 

® Shewhart, W. a.. Economic CorUrol of Quality of Manufactured Product 
(1931). Since Shewhart's pioneer efforts on this important subject, much 
progress has been made, so much that one might say a new craft has been 
created. 
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statistical means of showing the actual statistics of variations 
in quality, the economically permissible variations in quality, 
and the statistical measurement of ways of locating and cor- 
recting causes of quality variation beyond set limits. Surfi 
control is designed to reduce the number of products that 
must be discarded as below standard; consequently, if successful, 
quality control reduces waste and lowers manufacturing costs 
per unit of output. In addition, selling costs are reduced and 
good will improved, because quality control decreases the 
number of customers who become dissatisfied as the result of the 
inconvenient necessity of returning inferior products. 

Although used in the American Telephone and Telegraph 
Company under the leadership of Dr. Shewhart, application 
of statistical quality control has been negligible in the United 
States. ^ In Great Britain, however, the idea of statistical 
quality control was accorded an enthusiastic reception following 
Shewhart^s visit to London in 1932. A committee headed by 
Dr. E. S. Pearson was organized by int(ircsted British indus- 
trialists to consolidate previous progress and facilitate adoption 
of the technique.^ By 1937 in England the methods had been 
applied to coal, coke, cotton yarns, cotton textiles, woolen 
textiles, spectacles glass, lamps, building materials, and manu- 
factured chemicals. * 

Physics and Chemistry, There is no dispute among modern 
physicists and chemists as to the importance of statistical 
methods in their sciences. Even the highly metaphysical Sir 
Arthur Stanley Eddington in his Nature of the Physical World 
(1928) attaches great importance to statistics in the natural 

^ Two reasons have been given for this failure of statistical quality control 
to be applied in the United States: “[first,] a deep-seated conviction of 
American production engineers that their principal function is so to improve 
technical methods that no important quality variations remain, and that in 
any case the laws of chance have no proper place among modern ^scientific' 
production methods; second, . . . the difficulty of obtaining industrial 
staj^isticians who are adequately trained in this fairly complicated field." 
Freeman, H. A., “Statistical Methods for Quality Control," Mechanical 
Engineering y Vol. 59 (1937), pp. 261-262. 

* Pearson, E. S., The Application of Statistical Methods to Industrial 
Standardization and Quality Control (British Standards Institution, London, 
1936). 

* Freeman, op, dt,, pp. 261-262. 
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sciences. In fact, he says that the laws of nature divide them- 
selves into three classes, (1) identical laws, such as the law of 
conservation and the law of gravitation; (2) statistical laws,, 
^ch as Boyle^s law, the second law of thermodynamics, and 
quantum laws; and (3) transcendental laws, which are genuine 
laws of control in the physical world.^'^ 

In physics, statistical technique is employed in the study of 
molecules. This modern statistical approach has a philosophical 
background that goes back at least as far as Boltzmann, who 
in 1866 expressed the second law of thermodynamics in terms of 
probabilities. His contribution was regarded as a form of 
mysticism until it was demonstrated by research during the 
first two decades of the twentieth century. ^ At the turn of the 
twentieth century Max Planck was trying to explain why pieces 
of matter heated to high temperatures emit more light of one 
wave length than of any other and less light at both larger and 
shorter wave lengths. lie could not explain this phenomenon 
except by supposing that light is emitted by atoms not as con- 
tinuous trains of electromagnetic waves but in discrete bundles 
of energy that he called quanta.^' Similar experimental 
work accompanied by new theoretical contributions, notably 
those of Heisenberg, led to the formulation of the modern 
statistical approach to the natural sciences. Within three 
decades this new theory has come into widespread practical 
use also, having found application in explanation of the behavior 
of photographic plates, the conduction of electricity through 
wires, the conduction of heat through walls, the behavior of 
photoelectric ^ells, the manner of efnission and absorption of 
light by atoms and molecules, and in the theory of metals.^ 

As explained by a recent popularizer of the natural sciences,^ 
Newtonian mechanics succeeds in accurately predicting motion 

^ Stebbing, L. S., Philosophy and the Physicists (1937), p. 70. 

^ Haas, Arthur, The New Physics (1923), pp. 38-44. 

® Cf. Eldridqe, J. a., The Physical Basis of Things (1934), pp. 357-358. 

* De Broglie, Louis, Matter and Light, The New Physics, translated^y 
W. H. Johnson (1939). For a popularized description of the experimental 
development based upon Boltzmann and later Heisenberg’s theories, see 
also Eddington, op, cit, Cf. William M, Malisoff, review of De Broglie’s 
Matter and Lights in The New York Times Book Review j Oct. 1, 1939. Also 
see H. Lifschutz and O. S. Duffendack, “The Counting Losses in Geiger- 
Miiller Counter Circuits and Recorders,” Physical Review^ Vol. 54 ^ov. 1,* 
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occurring on the human scale and also on the scale of celestial 
bodies. In other words, Newtonian mechanics does well for 
macroscopic measurements. But in the investigations of the 
motion of the microscopic particles inside the atom, Newtonidn 
mechanics ceases to have value, while quantum mechanics makes 
it possible to grasp the meaning of new principles that must 
necessarily be introduced in these more minute analyses. The 
principles referred to are statistical in nature and are based 
in large part on the theory of probability. ^^It is impossi- 
ble to measure several physical quantities (as energy, position, 
momentum) accurately at the same time. It is this neces- 
sary inexactness that has forced us to find our ultimate laws in 
probabilities.’’^ 

It must not be supposed that the new statistical approach, 
which is said to have been derived from Heisenberg’s uncertainty 
principle, necessarily has thrown into chaos concepts of physical 
measurement. The admission that laws in quantum mechanics 
are statistical may destroy the idea that the universe is a huge 
machine; but in a given case, with the initial conditions deter- 
mined as precisely as the principle of uncertainty permits, the 
probability of all subsequent states is determined by exact 
mathematical probabilities. There is nothing lawless in quantum 
phenomena.^ Analysis shows, moreover, that the theoretical 
uncertainty, which prohibits a simultaneous accurate measure- 
ment of position and of velocity, is noticeable only in dealing with 
the very minute masses of the subatomic world. With ordinary 
masses, the theoretical uncertainty, though still existing, falls 
below the practical uncertainties, which are due to’ the imperfec- 
tion of human observations, and is completely submerged by the 
latter. This gradual obliteration of the quantum uncertainties, 
as the scientific observer passes to the commonplace level of 
average masses, is the reason why Newtonian mechanics is still 
used. For the small velocities and relatively large masses with 


19^8), pp. 714-726; A. Ruark, ‘‘The Time Distribution of So-called Random 
Events/* Physical Review, vol. 66 (Dec. 1, 1939), pp. 1166-1167; E. R. 
Rutherford, Radiations from Radioactive Substances (1930), Chap. VII, 
pp. 171-172. 

1 Eldridge, op. ciL, p. 376. Cf, Tolman, R. C., The Principles of Star 
tistical Mechanics (1938), p. 66. 

**S'rEBBiNO, dp. dt.y p, 183. 
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which the scientist is usually concerned, the two mechanics yield 
results so nearly alike that in practice no experiment would be 
sufficiently refined to detect the difference. Since the Newtonian 
mechanics is mathematically the simpler, there is every advantage 
in retaining it,^ 

Except for Einstein^s theory of relativity there has been nothing 
so to stir the imagination of the natural scientists in the twentieth 
century as this new statistical approach. In fact, one writer has 
said that the entire structure of modern physics and chemistry, 
and therefore of all the natural sciences to which they are funda- 
mental, rests upon quantum mechanics.^ 

From the above discussion it is readily apparent that statistical 
techniques are helpful, not only to theories in the natural and 
social sciences, but to the arts dependent on those sciences. Yet 
for many students the most important reason for knowing some- 
thing about the fundamentals of statistical method is the need for 
intelligent discrimination between the proper and improper use of 
statistics. Unfortunately, a large portion of the extensive 
modern employment of statistics in all fields falls under the latter 
heading. This is especially true in popular presentations of 
modern scientific and political matters. Too close attention to 
the mechanics of a method and the neglect of common sense are 
responsible for a large number of these horrible examples. All too 
often, preoccupation with the technique dims common sense. 

Statistics and Philosophy. Nineteenth-century cocksureness 
of the scientific approach, pretending to such a degree of precision 
and to such ‘broad scope as to annihilate the foundations for 
ethical, moral, •and religious faiths, has largely disappeared. 
Under the aegis of the assertive and materialistic science of the 
nineteenth century, belief in free will was dwindling to a mere 
superstition; but the element of indeterminacy brought into 
science as a result of the application of the theory of probabilities 
again permits freedom. This decline of mechanistic assurance in 
science has not been ignored by philosophic thought, which has 
emphasized as never before a lesson that has often recurred in the 
history of philosophy: objective reality is not always identical 
with subjective concepts.® Eddington expresses these doubts 

^ D^Abro, a., The Decline of Mechanism (1939), pp. 37-57. 

* Harrison, pp. ciL, pp. 341-342. 

® Eldridob, op. dt., pp. 379-380. 
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in the following words ^^Does the spectroscope find the colors, 
or does it make them? When the late Lord Rutherford showed 
us the atomic nucleus, did he find it or did he make it? How 
much do we discover and how much do we manufacture by «uur 
experiments? '' 

Just as surely as the railroad destroyed the supremacy of the 
stagecoach or the radio eclipsed the popularity of the phonograph, 
so have the discoveries of modem science eclipsed faith in many 
ideals and beliefs that served to give reason to the lives of the 
masses of the people. In the realm of ethical and moral values, 
buttressed by the dogma of a bygone age, nineteenth-century 
scientific method was almost wholly destructive and hardly at all 
constructive. Modern philosophy criticized scientific method, 
both the laboratory and statistical branches, for failing to provide 
new moral values to replace outmoded prescientific ones. Despite 
this gloomy aspect, philosophy’s greatest spokesmen look to 
scientific method itself to obtain the necessary enlargement 
of the conception of human nature and the formulation of the 
required new moral values. John Dewey envisages the use of 
scientific method to create a comprehensive democratic culture as 
a guarantor of genuine freedom.^ 

SUMMARY 

Statistical method — the quantitative expression of knowledge, 
the marshaling of facts and their arrangement in a form suitable 
for scrutiny — has been the means employed by businessmen, 
natural scientists, and social scientists to establish bases for 
judgments regarding factual data so complex or so numerous as 
to be, in the unmarshaled state, intellectually incomprehensible. 
Commercial statistics and their interpretation may, indeed, be 
said to constitute the scientific background of business today. 
Men cannot conduct their business intelligently without them. 
Quite as important as statistics of commerce and trade are the 
more recently developed industrial and social statistics, data on 
^employment and pay rolls in industry, trade, and finance and on 
the distribution of income. 

In the science of government and its practical art the sta- 
tistical approach has proved itself essential as facts have accumu- 

^ Op. ciL^ pp. 108-109. 

* Freedom and Culture (1939), passim. 
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lated and, to an increasing extent, as means have been developed 
for making the quantitative units of measurement required. 
The importance of statistical techniques in the natural sciences is 
attested to by the definition of “science” so familiar to every 
schoolboy: Science is systematized knowledge. Statistical 
mechanics is essential to an understanding of modern physics 
and chemistry. Whatever the individual’s station or calling, he 
is to a greater or lesser extent using statistical techniques. 



CHAPTER II 


GATHERING STATISTICS 

♦ Before the commercial revolution of the sixteenth century, 
social and economic life was relatively simple. The small villages 
and towns were self-sufficing economic and social units. Little or 
no statistical enumeration of facts was required to comprehend 
the extent of the population, the number of buildings, the number 
of cattle, and the quantity of other constituent units in the com- 
munity. Within the limited range of space and time usually 
contemplated, events having to do with the welfare or distress of 
the community were not complex. Judged by modern standards, 
government was simple and inexpensive because social and 
economic relationships were not complicated. Even the great 
cities of the time were not large compared with modern metro- 
politan districts. In population, wealth, and trade, the extent 
of a sixteenth-century nation was inconsiderable and, furthermore, 
was growing almost imperceptibly. In other words, conditions 
were relatively simple and static. 

Genesis of Fact Marshaling. Under such conditions, little 
was done in the way of the systematic gathering and analysis 
of statistical data. The situation did not demand the con- 
tinuous assembling of up-to-the-minute facts. ^Indeed, it was 
not profitable to do so. The motive did not exist; in sufficient 
force to direct attention to the problem of expressing quanti- 
tatively the events of contemporary social and economic life; 
and the facts of the natural sciences were obscured in medieval 
mysticism or cherished from a forgotten age by a few scattered 
and scholarly churchmen. Nevertheless, it was found useful 
o^n occasions to make great surveys that could subsequently 
serve as the basis of governmental decision in regard to taxation 
and other social activities, and that might also be a guide to 
private enterprise. Pepin the Short in 758 and Charlemagne 
in 762 demanded detailed descriptions of church lands, while 
several works written in France during the first half of the ninth 
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century gave a partial enumeration of the serfs attached to the 
land.^ Likewise, when William the Conqueror undertook the 
reoj-ganization of the national government of England in the 
eleventh century, he found it desirable to make his famous 
survey, which resulted in Domesday Book, completed about 
1086.^ Also, as early as the fourteenth century, the medieval 
guilds gathered statistics in connection with their regulation of 
markets.® Later, in the fifteenth century, as the breakup of 
the medieval system gathered momentum and as the rise of 
trading groups accelerated, there was a great increase in the 
amount of statistical work done by guilds as well as by central 
governments — the latter not infrequently through guild organiza- 
tions or through the Church. Economic statistics were collected 
when the occasion demanded, for example, when the upsetting 
of a customary price by a flood or drought required explanation 
and the determination of a new customary price. Although 
there is evidence that in these several ways statistics were 
assembled, they were neither methodically made nor preserved. 
There are isolated instances of the registration of deaths or 
baptisms in the fourteenth and fifteenth centuries, but it was 
not until the sixteenth century that any considerable movement 
toward statistical enumeration of facts occurred. 

Development of a Dynamic Social Order. During the Renais- 
sance, from thirteenth-century Italy to fifteenth-century Spain 
and England, the quantity of data in the physical sciences 
steadily accumulated from experimental efforts of astronomers 
and other scientists. The most dramatic of all human experi- 
ments was made by the voyagers seeking to prove that the world 
is round. The discovery of America and the voyages of explora- 
tion of the sixteenth century gave great impetus' to the develop- 
ment of trade and the growth of nations.^ Motivated by the 
economic ideals of mercantilism, a period of trade development 
followed, the domestic system of manufacture rapidly expanded, 

^ Walker, Helen M., Studies in the History of Statistical Method (1929), 
pp. 32-33. The History of Statistics (compiled and edited by John Korop, 
1918). 

* Cheyney, Edward P., An Introduction to the Industrial and Social 
History of England (1925), pp. 17-18. 

®Faure, Fernand, “The Development and Progress of Statistics in 
France,*' in Keren, History of Statistics, pp. 229-233. 

^ Faulkner, H. U., American Economic History (1931), pp. 34-5'3^ 
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and the colonial empires of Portugal, Spain, France, Holland, 
and then England, emerged. The change was from haphazard 
and occasional trade of the merchant adventurers of the sixteenth 
century to the more or less systematic and regular international 
and intercolonial trade of the seventeenth and eighteenth cen- 
turies. Along with this trade development came the necessity 
for obtaining more regular information concerning markets, 
population, wealth, prices, and the movements of merchandise 
and gold. Furthermore, with this growth of trade both in 
volume and in complexity, governmental and social organization 
became more complex. 

As the fact of change was revealed by the events of the com- 
mercial revolution, national governments began to feel the 
need of more regular fact finding in order to visualize and to 
interpret changing conditions. Yet it must not be supposed 
that well-organized or any considerable amounir of statistical 
data for the sixteenth or even the seventeenth century can now 
be found. It was more a case of an awakening of the will to do 
rather than a case of actual accomplishment. For it was really 
the Industrial Revolution and the vigorous growth that took 
place in the eighteenth and nineteenth centuries that gave the 
actual impetus to systematic marshaling of quantitative facts. 
It was not until the early part of the nineteenth century, indeed, 
that most of the essential principles of statistical method, even 
for purely descriptive purposes, had been evolved. Also, the 
compilation and current use of statistics as practiced today 
have been made possible only by the growth in transportation 
and communication facilities, a nineteenth-century phenomenon. 
It was also from the eighteenth century onward that the achieve- 
ments of scientists in accumulating experimental data for the 
natural sciences fired the imagination of scholars to solve the 
problem of data accumulation for the sciences of life and social 
behavior. 

Quantitative Expression of Facts, Where there are large 
populations, great nations of tens of millions of people, all 
problems of social, economic, and political organization are 
increased many times in complexity and, furthermore, new 
problems arise. The problem of feeding such large populations, 
the problem of housing them, the problem of keeping them 
employed and preventing them from harming each other, to 
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mention but a few of the considerations confronting the govern- 
mental administrator — all these are vastly complex owing to the, 
great expanse of geographical space covered and the varying 
conditions at different places and times. In simple economies 
many of these problems can be solved by permitting individual 
freedom of choice and free economic enterprise; but as the 
community becomes more and more closely knit in economic 
and social relations and as various forms of economic power 
emerge, individual freedom of choice and free economic enter- 
prise become goals that must be consciously sought by organiza- 
tion rather than natural tendencies that develop unaided.^ 

In the intricate social and economic organizations of the 
modern era, it is inconceivable that any individual or group of 
individuals can obtain the knowledge necessary to form judg- 
ments concerning the issues that arise. An individual can 
comprehend only those conditions within a reasonable geo- 
graphical area about him; the more complicated society is, the 
smaller the area about him that he can understand without the 
use of statistics. ^‘We are overwhelmed, not only by the diver- 
sity of knowledge, but also by the diversity of possible deeds, of 
possible values, and of possible judgments,’^ and, further, “this 
human mind, whose needs Plato so perfectly understood, still 
insists upon constructing for itself a fixed world in the midst 
of a fluid one. It persists in thinking in terms of aims and 
ends and perfections; of ideals, of purposes, and final goods; 
and, at the very last, it insists upon assuming some direc- 
tion in change, something toward which the chain of events is 
moving. • 

In this effort it is impossible for the individual to survey the 
conditions qualitatively — it would take him many human life- 
times to inspect the whole population, and the capacity of the 
human brain is not adequate to the task of absorbing so complex 
an impression. If he attempts a microscopic survey, he is 
quickly smothered by overwhelming detail. If he attempts a 
macroscopic survey without the use of statistics, he is compelled 

^ For the complexity of modern society, as it is reflected in statistics, see 
publications of the United States Bureau of the Census. For suggestive 
special studies, see Corrington Gill, Wasted Manpower (1939); Henry Pratt 
Fairchild, People (1939). 

* Krutch, J. W., Art and Experience (1932), pp. 121, 211. 
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to resort to guesswork and commonly originates cloud-push- 
ing^^ fantasies. Furthermore, the individuaFs personality tends 
to bias him not only in his observations but also in his judgments. 
If he is temperamentally inclined to be impressed by sordid 
things, he is likely to notice them more than the good things in 
his surroundings, and his judgment is correspondingly pessi- 
mistic. On the other hand, if he is temperamentally optimistic, 
he tends to consider as the rule the good things and to regard 
as unusual the sordid things of life. Where it is necessary to 
gain knowledge concerning large populations of people and 
things, where social and economic life is complex, there is need 
to use statistics. 

Rational Basis for Gathering Data. Quixotically, accumulating 
data is not to be confused with scientific fact gathering. The 
progressive accumulation of useful quantitative facts has been 
stimulated and furthered by a definite, conscious purpose. 
To look at the process historically, it was the rise of nationalism 
and the mercantilistic ideal that supplied definite purpose for 
the fact-finding inquiries of the eighteenth-century political 
arithmeticians. Modern survivals of the same nationalistic 
and mercantilistic ideals impel governments to spend vast 
sums for the collection of statistics designed to measure the 
wealth and material position of the nation and to furnish business 
enterprise with facts about markets. Underlying much of this 
effort abides also a sincere interest, stimulated by scientific 
research, in real human welfare. As a consequence, the modern 
census attempts to collect quantitative facts directly or indirectly 
concerning the health and morals of the nation's people. The 
subsequent usefulness of such statistical data depends upon 
how well the simple rules of common sense have been followed 
in assembling and in presenting the data. 

Units of Description and Measurement The units of descrip- 
tion or measurement by means of which quantitative facts 
are to be assembled must be carefully defined. When defined, 
s^ich units must be strictly applied during the assembling of the 
data and in all subsequent analysis. It is accordingly of the 
utmost importance clearly and fully to describe units of descrip- 
tion and measurement in all subsequent use of the data. Such 
rules are so clear and so easily resolved into matters of simple 
common sense that it seems almost a waste of time to . direct 
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attention to them; yet to follow them is not always so simple a 
matter as might be supposed. 

For example, in 1940 thousands of enumerators undertook 
the task of counting the population of the United States, of 
counting the number of farms, farm animals, and all other types 
of wealth, and of obtaining specified information concerning 
every person living in the United States. One may ask: Why 
should mere counting be a complicated task? This question 
would be quickly answered by the farmer^s boy who has just 
finished trying to count the number of chickens in his pens. 
Everything would be easy if they would only stand still. People, 
as well as chickens, do not stand still while they are being 
counted, and simple matters mount up into a veritable host of 
intricate difficulties. Suppose you were an enumerator and 
in the first house approached you found that the mother is in 
the maternity hospital, a baby was born at 10 a.m. of the census 
day, one son is away at college in another state, a daughter is 
boarding and rooming in a neighboring town, where she teaches 
school, and the father is in jail for evading income taxes. On 
several points you would feel the need for very specific instruc- 
tions. To avoid double counting or the failure to count many 
individuals, instructions to the enumerators must be given with 
great care; every possible complication must be foreseen. 

In recording facts about manufacturing and trading, or 
merchandising, enterprises in separate categories, when is an 
enterprise a manufacturing concern and when is it a trading, 
or merchandising, concern? In recording statistics about farms, 
when is a farm not a farm but a truck garden? These few 
examples are probably sufficient to emphasize the point that the 
unit must be carefully defined and that the defined unit must be 
strictly followed and freely or even religiously disclosed to all 
who in the future use the statistics. 

Carefully planned schedules of questions, often called ^‘ques- 
tionnaires,^^ are the principal means of gathering statistics. 
These vary from schedules simple enough for oral presentation, 
as frequently utilized in polls, to the elaborate forms used by the 
government or research organizations. In the first phase of 
statistical investigation, the gathering of facts, care in following 
all the rules of common sense and logical definition is epitomized 
in the formulation of the questionnaire, or schedule. 
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QUESTIONNAIRES OR SCHEDULES 

Official Example of Care in Description of Units. In taking the 
United States census, for example, the assurance of accuracy^ in 
regard to these important but detailed matters is guarded by a 
skillfully organized system. Forms are supplied, with column 
arrangement for writing in all the required information. A 
question appears at the head of each column, and the columns, 
and therefore the questions, are grouped into subjects; thus in the 
schedule for the 1940 population census there are 34 columns 
grouped under the subjects location, household data, name, rela- 
tion, personal description, education, place of birth, citizenship, 
residence, and employment status. In addition, columns 35 to 50 
contain supplementary questions to be asked only a sample of all 
the persons enumerated. Figures 1 to 3 show in three sections 
the 34 questions asked of all persons. Figure 4 is included to 
show the questions on employment status in the 1930 census, 
revealing thereby the great elaboration of this type of question in 
the 1940 census. 

Sample forms that had been filled in with illustrative answers 
were supplied to enumerators, and a complete, simple description 
of the manner in which the form was to be filled out was printed on 
the sample schedule. Pamphlets were issued for the use of 
enumerators, giving detailed instructions. For the 1940 census 
there were issued to enumerators taking the census of population 
and agriculture a printed and indexed pamphlet 173 pages long. 
This gave detailed definitions and described procedure for enu- 
merators to follow under the various circumstances that might 
arise in their house-to-house canvasses. 

Moreover, the enumerators worked directly under experienced 
district supervisors, who, in turn, were under area ^managers 
responsible to the Bureau of the Census in Washington. To train 
the 529 district supervisors in the 1940 census taking and census 
procedures, a picked group of more than one hundred men from all 
parts of the country were given a special course of instructions in 
W|ishington. Those who passed the examination were sent out as 
area managers to the 104 census areas, each to direct the training 
of five to seven district supervisors and to act as regional manager 
between them and the Bureau of Census in Washington. 

The 529 districts were broken into enumeration districts of 
which there were about 147,000. Generally speaking, there was 
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Fig. 1. — The first 12 questions on the 1940 census schedule grouped under topics location, household data, 
name, relation, and personal description. Notice the sample entries. 
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Fig. 2. Questions 13 to 20 on the 1940 census schedule grouped under topics education, plac* of birth, 

citizenship, and residepce. 
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Fig. 3. — Questions 21 to 34 on the 1940 census schedule grouped under topic employment status of person fourteen years old and 
" over, with subgrouping indicated in captions of columns. Notice the sample entries. 
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Fig. 4. — Questions in the 1930 census schedule on employment status under topics occupation, industry, and 
employment. In the 1940 census schedule these questions were placed among the supplementary ones. 
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one enumerator in each of these districts, but in certain regions 
one enumerator covered more than one district. Therefore, 
about 123,000 enumerators were used. Wide publicity for such 
carvjful preparation in the case of the 1940 census resulted from 
Congressional protests about some of the new questions.^ 

To illustrate the necessity for careful definition of units and 
description of procedure and to solve the census problem of the 
amazing family described above, the following is quoted from the 
Instructions to Enumerators 

Who Is TO Be Enumerated in Your District 

300. The problem of who is to be enumerated in your district is 
extremely important. Therefore, study very carefully the following 
rules and instructions. 

301. The Census Day. There should be a return on the population 
schedule for each person alive at the beginning of the census day, i.e., 
12:01 A.M. on Apr. 1, 1940. Thus, persons who died after 12:01 a.m. 
should be enumerated ; and infants born after 12:01 a.m. on Apr. 1, 1940, 
should not be enumerated 

302. Usual Place of Residence. Enumerate every person at his 
usual place of residence. This means, usually, the place that he 

would name in reply to the question Where do you live?^' or the place 
that he regards as his home. As a rule, it will be the place where the 
person usually sleeps. 


Persons to Be Enumerated in Your District 

304. Enumerate all men, women, and children (including infants) 
whose usual 'place of residence is in your district or who, if temporarily in 
your district, have no usual place of residence elsewhere. Persons who 
move into your district after Apr. 1, 1940, for permanent residence 
should be enumerated by you, unless you find that they have already 
been enumerated in the district from which they came. 

305. Residents Absent at Time of Enumeration. Some persons having 
their usual place of residence in your district may be temporarily 
absent from the household at the time of the enumeration. These you 
must enumerate with the other members of the household, obtaining 
the information regarding them from their families, relatives, acquaint- 
ances, or other persons able to give it. However, do not include with 

1 The New York Times, Feb. 27-29, Mar. 1-3, 1940. 

® Bureau of the Census, Instructions to Enumerators, Population and 
Agriculture, pp. 14-18, 80-81. 
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the household a son or daughter permanently located elsewhere or 
regularly employed elsewhere and not sleeping at home. 

306. Persons to be counted as members of the household include the 

following: r 

a. Members of the household temporarily absent at the time of the 
enumeration, either in foreign countries or elsewhere in the United 
States, on business or visiting. 

h. Members of the household attending schools or colleges located in 
other districts, except student nurses away from home and students in 
the Naval Academy at Annapolis or in the Military Academy at West 
Point or in any other training school or institution operated by the 
War or the Navy Departments or the United States Coast Guard. 

c. Members of the household who are in a hospital or a sanitarium 
but who are expected to return in a short period of time. 

d» Servants or other employees who live with the household or sleep 
in the same dwelling. 

c. Boarders or lodgers who sleep in the house. 

/. Members of the household enrolled in the Civilian Conservation 
Corps. 

307. In the great majority of cases the names of absent members will 
not be given to you by the persons furnishing the information, unless 
particular attention is called to them. Before finishing the enumera- 
tion of a household, therefore, you should ask the question: Are there 
any members of the household who are absent? 


Persons Not to Be Enumerated in Your District 

313. There will be a certain number of persons present, and perhaps 
lodging and sleeping in your district at the time of the enumeration, who 
do not have their usual place of residence there. As a rule, do not 
enumerate as residents of your district any of the following classes, 
except as provided in paragraph 314: 

а. Persons temporarily visiting with the household.. If, however, 
they do not have any usual place of residence from which they will be 
reported, they should be enumerated with the household. 

б. Households temporarily in your district which have a usual place 
of residence elsewhere from which they will be reported. 

c. Transient boarders or lodgers who have some other usual or perma- 
nent place of residence, that is, who have a house or apartment else- 
where in which they usually reside and where they will be enumerated. 

d. Persons from abroad temporarily visiting or traveling in the United 
States and foreign persons employed in the diplomatic or consular serv- 
ice of your country (see paragraph 331). (Enumerate other persons 
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from abroad who are students in this country or who are employed here, 
however, even though they do not expect to remain here permanently.) 

e. Students or children living or boarding with this household in 
ordef to attend some school, college, or other educational institution in 
* the locality but who have a usual place of residence elsewhere from which 
they will be reported. 

* /. Persons who take their meals with the household but usually lodge 
or sleep elsewhere. 

g. Servants or other persons employed by the household but not 
sleeping in the same dwelling, 

h. Persons who were formerly members of this household but have 
since become inmates of a jail; or a mental institution, home for the 
aged, infirm, or needy, reformatory, prison, or any other institution in 
which the inmates may remain for long periods of time. 

i. Transient patients of hospitals or sanitariums. Such patients are 
to be enumerated as residents in the households of which they are mem- 
bers and not as residents in the institution, unless they have no other 
place of residence at which they will be reported. 

314. When to Make Exceptions, In deciding when to make excep- 
tions to the rules indicated above, consider whether the household or 
persons temporarily residing in your district will be reported at another 
place of residence by some person in a position to supply the information 
required. If the persons or household will not be so reported, enumerate 
them as residents of your district. 

Enumeration of Special Classes of Persons 

315. You may experience some difficulty in determining whether to 
enumerate certain special classes of persons indicated below. In any 
instance in which you are not sure whether to include persons as resi- 
dents of your district, ask your squad leader or supervisor for further 
instructions. 

316. Servants, Enumerate with the household any servants, laborers, 
or other employees who live with the household and sleep in the same 
house or dwelling unit. However, enumerate servants who sleep in 
separate and completely detached dwellings as separate households, 
even though the dwelling is on land owned by members of the household 
by which the servants are employed. 

. . . . • 

318. Students at School or College, If there is a school, college, or 
other educational institution in your district that has students from 
outside your district, enumerate as residents of the school only those 
students who have no usual places of residence elsewhere. Especially 
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in a university or professional school there will be a considerable num- 
ber of the older students who are not members of any household located 
elsewhere. Find and enumerate all such persons. 

319. Schoolteachers. Enumerate teachers in a school or college at 
the place where they live while engaged in teaching, even though they • 
may spend the summer vacation at their parents^ home or elsewhere. 

320. Student Nurses. Enumerate student nurses as residents of the 
hospital, nurses^ home, or other place in which they live while they are 
receiving their training. 

321. Patients in Hospitals y Sanitariums y and Convalescent Homes. 
Most patients in hospitals, sanitariums, and convalescent homes are 
there temporarily and have some other usual place of residence. Enu- 
merate patients as residents of such an institution only if they have no 
other place of residence from which they will be reported. A list of 
persons having no permanent homes can usually be obtained from the 
institution records. 

322. Inmates of Prisons, Asylums, and Institutions Other than Hospi- 
tals. Your district may include a prison, reformatory, or jail, a home 
for orphans, for aged, infirm, or needy persons, for blind, deaf, or incura- 
ble persons, a soldiers^ home, an asylum or hospital for the insane or the 
feeble-minded, or a similar institution in which the inmates usually 
remain for long periods of time. Enumerate all the inmates of such 
institutions at the institutions. Note that in the (^ase of jails you must 
enumerate the prisoners there, however short the sentence. 


Census op Agriculture 
General Information 

Purpose of the Census of Agriculture. An act of Congress provides 
that a census of agriculture be taken every 5 years, for the purpose of 
obtaining basic information on farm acreage, land values, crops, live- 
stock, and other general items relating to agriculture. The Sixteenth 
Census, which will be taken as of Apr. 1, 1940, will include compre- 
hensive information on agriculture, including irrigation and drainage of 
farm land. 

Every enumerator must fill out a farm and ranch schedule for each 
tract of land in his enumeration district that might classify as a ^'farm'^ 
under the census classification, giving all the requested information. 
The information should be obtained by a personal visit of the enumer- 
ator. It is absolutely necessary that the census be complete and accu- 
rate. Census data are widely used by both private and public agencies 
and often form the basis for legislative and administrative programs. 
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The farmer should be made to feel that his contribution to the census 
is of real value to himself and to his community. 

Census Schedules Are Confidential, The Federal law providing for the 
census prescribes heavy penalties for revealing information to unauthor- 
ized persons. The enumerator should make it clear, in dealing with 
persons who seem unwilling to give the information requested, that he 
is not allowed to give any information to their neighbors or other 
persons; that only sworn census employees will have access to the farm 
schedules; and that those records for individual farms cannot be used 
for purposes of taxation, regulation, or investigation. 


Definition of a Earni. The definition of a farm found on the face of 
the schedule must be carefully studied by the enumerator. Note that 
for tracts of land of 3 acres or more the $250 limitation for value of 
agricultural products does not apply. Such tracts, however, must have 
had some agricultural operations performed in 1939 or contemplated in 
1940. A schedule must be prepared for each farm, ranch, or other 
establishment that meets the requirements set up in the definition. A 
s(;hedule must be filled out for all tracts of land on which some agri- 
cultural operations were performed in 1939 or are contemplated in 1940 
and which might possibly meet the minimum requirement of a “farm.” 
When in doubt, always make out a schedule. 


You now have instructions that will help enumerate the inter- 
esting family first encountered above — the mother will b(^ enu- 
merated (paragraph 321), the baby will not be enumerated 
(paragraph 301 the son will be enumerated (paragraph 318), the 
daughter will not be enumerated (paragraph 319), and the father 
will not be enumerated at the household, although if the jail is in 
town he will there be enumerated (paragraph 322). 

Figures 5 and 6 are photographic reproductions of parts of the 
Farm and Ranch Schedule used for the census of agriculture. On 
Fig. 6 appears the definition of a farm to which reference is made 
in the general information quoted above from the manual of 
instructions. Altogether the farm and ranch schedule contains 
232 questions on 16 subjects. The subjects include information 
about the operator, farm acreage, values, farm mortgage and 
taxes, irrigation, cooperative selling and purchasing, farm labor, 
farm expenditures in 1939, farm machinery and facilities, live- 
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Fig. 6. — ^Another portion of the Farm and Ranch Schedule to illustrate the type of question included and the definition of a farm. 
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Manufacturers' directories are examined and books and periodi- 
cals relating to the industry are read, thus obtaining the essential 
a priori background of knowledge to form the basis for sound 
proportional sampling.^ 

To obtain the data on wages and hours of labor, the Bureau of 
Labor Statistics uses carefully prepared and elaborate question- 
naires, one of which is illustrated in Figs. 7 and 8. Trained 
agents obtain the information from a responsible official of each 
local union. Each scale of wages and hours is verified by the 
union official interviewed and is further checked by comparison 
with the written agreements when copies are available. For 
example, in the building-trades survey for June 1, 1939, inter- 
views were obtained with 1,551 union representatives, and 2,729 
quotations of scales were received. The union membership 
covered by these contractual scales of wages and hours was 
approximately 444,000. Great care is exercised to see that the 
agents are adequately trained to collect the data; written instruc- 
tions are supplied them by the Bureau of Labor Statistics, 
in which they are cautioned as follows:^ 

In the final analysis the accuracy and value of the entire survey must 
rest upon the agents who collect the data. These data must be abso- 
lutely correct and so presented on the schedule as not to be confusing 
or ambiguous. Each agent is, therefore, requested to study thoroughly 
the instructions, not once but repeatedly, and to question any point 
therein which may not be perfectly clear. It is extremely important 
that the agent check every schedule carefully before mailing to the office 
to be sure that each item is correctly entered and explained. When 
agreements accompany the schedules, the agent milst compare each 
quotation with the provisions in the agreement and must explain any 
differences. 

In order to ensure the collection of comparable data from all 
agents, the instructions give painstaking definitions of union 
scale," ‘^collective agreement," “apprentices" and “foremen," 

^ For further details of the methods employed by the Bureau of Labor 
Blatistics, sec “Methods of Procuring and Computing Statistical Informa- 
tion of the Bureau of Labor Statistics (1923),!^ Bulletin 326; also “Union 
Scale of Wages and Hours in the Building Trades, June 1, 1939,** Serial 
R1034, from Monthly Labor Review ^ November, 1939. 

* Bureau of Labor Statistics, “Instructions for Survey of Union Scales of 
Wages and Hours, 1939** (No. 7468), p. 1. 
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12, Remarks or amplification of data reported on front of (K-hedule .. 


13. Have there been any changes in rates since June 1, 1239, or have any changes been agreed upon to b se ome 
effective in the near future? 


14. Agreements are negotiated with: 
(a) Individual employers 


(6) Employers' associalionCs) .. 


If (b), what proimrtion of union Arms participate or are represented in the association (s)T . 


For Building Trades Only 

16. Does union cooperate with employer in establishing or enforcing standards of fair competition? 
Yes □ No n If there is a written code of fair coiniMtition please attach a copy. If oral or customaiy 
arrangements, please explain 


If). What proportion of the 1- and 2-faniily dwellings in this community are being built under union conditions 


so far as this trade is concerned? . 


Does the union have a special rate for this type of work? Tes □ No □ 


Fiq. 8. — Second portion of questionnaire used by the Bureau of Labor Statistics 
to obtain data on union wages and hours. 
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Fig. 9. — Questionnaire used for consumers’ purchases study. 
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‘‘union rates'' and “actual rates," “union rates" and “prevailing 
rates," and “averages." 

Study of Family Income and Expenditures. In 1929 the Social 
Science Research Council suggested the advantages of conducting 
a study of consumption in such a way that the sample would cover 
a wide range of incomes, all types of natural families, and all 
occupations within representative communities of different 
sizes. Income data and certain other facts would be collected 
from all families visited, through the use of a short schedule. 
These data would provide the basis for selection of an adequate 
number of families in each income class to furnish more careful 
estimates of income and the details of expenditures. Following 
these suggestions, the National Resources Committee and the 
Bureau of Home Economics of the United States Department of 
Agriculture completed in 1939 a study of family income and 
expenditures. Figure 9 shows the questionnaire used.^ Tables 
of data based upon this questionnaire are shown in Chap. IV. 
It may be noticed that the type of question and indeed the whole 
schedule are much less complex, involving much simpler units, 
than any thus far illustrated. It was necessary for this schedule 
to be simpler than those discussed above because for the con- 
sumer-income study the agents were not so well trained as are, for 
example, the regularly employed field agents of the Bureau of 
Labor Statistics. 

Mailed Questionnaires. In some cases, especially where the 
schedule of questions is comparatively simple, questionnaires 
are sent through the mail to the sources of inforn^ation. Such a 
method may be used either where the units involved are very 
simple or where those who are filling out the questionnaires are 
known to be qualified to do so. The United States Bureau of the 
Census and the Bureau of Labor Statistics have been able to 
use this method to obtain certain types of information from 
manufacturing concerns regarding employment, pay rolls, manu- 
facturing output, labor turnover, and the like. The method 
appears to be most used where fairly simple facts are collected at 
regular intervals. Data on pay rolls and employment are 

^ Bureau of Home Economics, U.S. Department of Agriculture, ‘‘ Con- 
sumer Purchases Study,” Part I, Family Income, Miscellaneous Publication 
^ 339, pp. 338r-339 ; cf. National Resources Committee, Consumers* Incomes in 
the United States (1938), p. 49. 
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obtained by mailed questionnaires monthly by the Bureau of 
Labor Statistics from representative manufacturing establish- 
ments in 90 manufacturing industries.^ Figure 10 is an illustra- 
tion of the type of letter used by such agencies to secure the good 
will and cooperation of businessmen. ^ 

Where the qucstionnaire-by-mail method is used, the returns 
must be carefully edited and subsequent correspondence is 
frequently required to correct mistakes made on the returns. 
Manufacturing and merchandising concerns in this country have 
become trained in the matter of filling out questionnaires for the 
government through years of practice so that there has been built 
up a cooperative enterprise between the government and business 
in the gathering of business statistics. Although sometimes feel- 
ing the heavy burden of filling out numerous forms of this type, 
business is nevertheless glad to cooperate because it is eager to 
see each month the compilation of business data that emanates 
from government sources. 

Income-tax returns are of the nature of questionnaires and 
are a source of many important statistics. Everyone is familiar 
with the care necessary in the examination of the units involved; 
everyone who has had to handle a return or listen to the head of 
the family talk about it knows how detailed and specific are the 
printed instructions accompanying each form on which the return 
is made. In the case of the income-tax return, which frequently 
becomes so complicated as to require legal advice and expert 
accountants, the penalty for failure to file a return is sufficient 
to supply any incentive needed to overcome all obstacles. For 
failure to supplf^ information for the other types of questionnaire 
that have been discussed, with the exception of the census, there 
is no similar penalty — the business concerns fill out such ques- 
tionnaires in a spirit of public service and to obtain the resulting 
compilations of data. 

Rules for Constructing Questionnaires, Any investigator who 
is tempted to seek information by the questionnaire method 
will be well advised to spend considerable effort first, to make 
certain that the facts are not already available, and then to 

1 Bureau of Labor Statistics, '‘Employment and Pay Rolls,^’ Serial R1052, 
November, 1939, pp. 7, 11, and 16. 

2 This letter was used in January, 1940, with a new questionnaire revised 
to obtain better monthly data on labor turnover; 
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U.S. DEPARTMENT OF LABOR 
MMMU or UMR ITAlIfnCt 
WARHINaTON 


Dear Slri 

In reaponse to numerous requests for more detailed Informa- 
tion regarding labor turn-over in manufacturing industries, and in order 
that the monthly reports published nay become of greater value to you 
and others who cooperate with us, the Bureau of Labor Statistics ha» 
revised the form used in collecting turn-over data. 

1 am sure you will agree that one of the most Important re- 
visions of this form is the separation of accessions into two groups x 
the first, to show the number of workers rehlred after a separation of 
three months or less; the second, to include all other workers hired. 

The purpose of this breakdown, of course, is to determine whether or 
not accessions occur In connection with new Job opportunities or 
whether they are the result of temporary suspensions. 

We have also provided space to report separately changes in 
clerical, sales and supervisory personnel, so as to permit tabulations 
for the turn-over of these employees. If it is difficult or impossible 
for you to report this information separately, we shall appreciate the 
data either for the total of all employees or, if this too is not 
feasible, for plant employees only. 

We are enclosing copies of the revised forms for January. I 
want to assure you that the data which you furnish will be kept strictly 
confidential and will be used in such a way as not to reveal the Identity 
of any individual firm. 

We sincerely hope that o\ir labor turnr-over reports based on 
this new procedure will be more useful and valuable to you, and we 
Shan greatly appreciate your continued cooperation. 

Very truly yours, 

Isador Lubin, 

Commissioner of Labor Statistics 

Enclosures 

( 8678 ) 


Fia. 10. — A typical letter from the Bureau of Labor Statistics seeking to secure 
the good will and cooperation' of businessmen in the reporting of statistics. 
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investigate well the pitfalls of questionnaire making, which is 
a highly specialized art. There are six fundamental but simple 
rules to be followed: 

!•. The interest of the recipients of the questionnaires must be 
aroused or their cooperation obtained through some means. 
This may be done by engaging the support of some organization 
with which the individual informants are associated. For exam- 
ple, if the questionnaire is to go to bankers, the support or endorse- 
ment of the American Bankers Association should be enlisted. 
Interest in the questionnaire may also be aroused by the promise 
to furnish free copies of the summarized information when com- 
piled. In this manner and by the promise of secrecy regarding 
individual returns, various governmental units obtain great 
quantities of statistical information. 

2. The questionnaire should be as short as possible, consistent 
with the scope of information sought; and the individual ques- 
tions should be so formulated as to be free of all ambiguity. 
They should be simple. Avoid presenting “ problems that will 
puzzle the recipients of questionnaires. 

3. Where possible, arrange the individual questions so that 
replies can be brief and unequivocal. “ Yes or no or perhaps 
merely a check mark is the ideal answer. 

4. The letter transmitting the questionnaire should be 
brief and dignified and yet should “selF^ the idea to the 
informants. 

5. After all is prepared, try out the questionnaire along with 
the transmittal letter on a dozen or so of the potential question- 
naire recipient»in order to make final revisions before printing 
the questionnaires, or schedules. 

6. Always include a self-addressed stamped return envelope. 

The first five rules apply whether the questionnaire is to be 

used by trained enumerators or to be sent by mail, but special 
care must be exercised if sent by mail. Study of Fig. 9 will 
reveal that answers to all questions are quite simple, in some 
cases merely a check mark (see- questions VI, 1, 2, 4, and VII), 
in other cases the entry of a familiar numerical item. •Leis 
highly trained enumerators are required for handling such a 
questionnaire than are required for handling the United States 
census schedules. 
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EDITING 

When the questionnaire is received from the agent or from the 
respondent by mail, it must be examined. If any statement on 
the schedule conflicts with other statements or if the schedule 
is incomplete or lacks clearness, it may have to be returned to 
the agent or respondent for explanation or revision. This is 
called editing'^ the returns or the schedules. In any case, a 
certain amount of editing must always be done before tabulation 
of the data is begun. When trained visiting enumerators have 
been used in the survey, there will, of course, be a minimum 
of mistakes. When the questionnaires have been filled out by 
the informant directly, it may be necessary to write for further 
information or for corrections because of inadvertent mistakes 
in replies. If the respondents have been interested sufficiently 
to return the questionnaire with* answers filled in, they will 
probably be willing to answer further simple questions to eluci- 
date their former replies. If it is believed that the information 
has been deliberately falsified or withheld, it may bo necessary 
to discard the entire schedule or at least the replies in it that 
seem to be of doubtful truth. 

Editing the schedules is the process of preparing the original 
statements in the schedule for classification, coding, and tabula- 
tion. Careful editing is necessary in order to obtain compilations 
of data that will truly reflect the conditions being investigated. 
One task of editing is to see that all figures entered on the return 
are clear. If not, the editor rewrites the figures. If so poorly 
written that even the editor cannot read them, the schedule 
must be abandoned or the information obtained by further 
correspondence. If the editing is done locally, many of these 
difficulties may be eliminated by telephoning. 

The principal task in editing is to locate all incomplete, incon- 
sistent, or improbable and impossible answers. When such 
answers are found, it is necessary either to discard the defective 
schedules or to obtain correct -replies through further inquiry. 
This^oes not, of course, imply the elimination of unexpected” 
replies. An incomplete answer, for example, would be if pneu- 
monia is given as the cause of death; it is necessary to know 
whether it is bronchial or lobar pneumonia. An inconsistent 
answer, for exsimple, would be if a return shows a person widowed 
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when from his age it is clear that he never could have been 
married. If a person who is a male is reported having died of a 
disease that is known to occur only in females, this is an impos- 
sibl?^ answer. There is somewhat less distinct a line between 
improbable and simple unexpected replies. 

Only after incomplete, inconsistent, or improbable and impos- 
sible replies have been completed or corrected and all unclear 
figures carefully clarified arc the schedules ready for coding, 
classification, and tabulation. For elaborate undertakings like 
the census, instructions are printed not only for the guidance 
of the enumerators but also for the editing and coding of the 
returns. For example, it is pointed out that the examination 
for completeness and consistency should be made family by 
family and not line by line. It Avill be easier to follow the entries 
belonging to the family if a strip of cardboard is placed across 
the schedule just under the line containing the entries for the 
last member of the family.^ The coding and editing instructions 
say that all corrections and code figures entered on the schedule 
by the coding clerks should be made with red ink and a medium- 
point pen (neither a stub nor an extremely fine pen). Such a 
detailed instruction as this is necessary in order to secure uni- 
formity and when tabulation is undertaken will (inormously 
facilitate the work of the card-punching operators. 

CODING 

Whether or not mac.hine tabulation is used, the coding of the 
schedules is a measure for economizing time. When large 
amounts of dat«; are involved, consistent classification is enor- 
mously simplified by the use of code numbers. In arranging 
data it is then necessary only to observe a code number con- 
spicuously and uniformly placed on the return instead of reading 
a title and remembering to what clavss that title belongs. On 
a Works Progress Administration project to construct indexes of 
manufacturing employment and pay rolls in the state of New 
Jersey, 1923-1940, it was not possible to obtain the use of tabu- 
lating machines. It was found necessary, nevertheless, to use ft 
carefully worked out coding procedure to avoid hopeless con- 
fusion in the handling of the data, which came monthly from 

^ C/. United States Bureau of the Census, Instruclion Manuals on Coding, 
passim. 
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several hundred reporting firms. When machine tabulation is 
used, the coding procedure is a necessary step; it will be noticed 
that on the schedules (see Figs. 1 to 8) columns are inserted for 
the code numbers or letters to represent the various types of 
information on the schedule. 

An Illustration of Coding. In the 1939 census of manufac- 
tures, the manufacturing industries in the United States were 
grouped into 20 groups, each with a number. Food and kindred 
products constitute group 1; its code number is 100. Lumber 
and timber basic products form group 5; its code number is 500. 
Chemicals and allied products are group 9; its code number is 900. 
All subgroups of industries in the food and kindred products 
classification have code numbers in the 100\s; for example, 
beverages are numbered in the 180’s — nonalcoholic beverages 
is 181; malt liquors 182, wines 184, and so on. Grain-mill 
products are numbered in the 140’s — flour and other grain-mill 
products is 141, cereal preparations 143, rice cleaning and 
polishing 144, and so on. Confectionery and related products 
are numbered in the 170’s — chocolate and cocoa products is 172, 
chewing gum 173, and so on. Similarly, subgroups of industries 
in the chemicals and allied products classification have code 
numbers in the 900’s; for example, industrial-chemical industries 
are numbered in the 980’s — plastic materials is 982, explosives 
983, coal-tar products, crude and intermediate, 981, and so on.^ 

The classifications adopted by the United States Bureau of the 
Census for the 1939 census of manufactures follow closely the 
suggestions made by the Technical Subcommittee on Industrial 
Classification composed of representatives of various govern- 
ment agencies. 2 Tlie suggested classification of this subcom- 
mittee, designated the Standard Industrial Classification Code, 
was made according to the following principles:® 

1. The classification should conform to the existing structure 
of American industry. 

^United States Bureau of the Census, “Industry Classifications for the 
Census of Manufactures, 1939,^' Form 75. 

* Members of the subcommittee included representatives of the Depart- 
ment of Labor and Industry of New York State, the Federal Social Security 
Board, the Bureau of Internal Revenue, the Bureau of Labor Statistics, the 
Bureau of the Census, the United States Employment Service, and the 
Central Statistical Board. 

* Central Statistical Board, May 10, 1938. 
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2. The reporting units to be classified are establishments. 
(An establishment is defined as a place of business. All persons 
working at the same location or place of business are classified 
in tlie same industry.) 

3. Each establishment is to be classified according to its 
major activity. 

4. Each industry group established must have significance 
from the standpoint of the number of establishments and 
employees involved, volume of business, employment and pay- 
roll fluctuations, and other important economic features. 

TABULATION 

When the schedules have been edited and coded they are 
ready for the operations of the card-punch machines, and the 
final machine tabulations arc made from these punched cards. 
The information on each schedule is transferred in code to the 
punch cards. With a machine resembling a toy typewriter, 
operators punch holes or combinations of holes in the cards so 
that the electrically operated machinery for sorting and tabulating 
can automatically transfer the information to totals by any 
classification desired. The punch card somewhat resembles 
the music roll of an old-time player piano, and most of the 
operations through which it goes are mechanical and electrical. 

The 1930 census required the punching of 326,635,219 cards, 
which required an additional handling for verification. These 
cards represented 2,000,000 pounds of paper and would make a 
belt reaching nearly twice around the world at the equator. 
Punching, tabijSating, and reflated work were equivalent to the 
handling of 4,701,671,697 cards once. 

The Bureau of the Census has its own unit tabulating equip- 
ment. Some of these machines can digest 400 cards a minute. 
The unit machines were invented and developed within the 
Bureau by Herman Hollerith, who was employed in the Bureau 
and invented the first machine to tabulate the 1890 census. 
He is now known as the “father of machine tabulation,'^ used 
throughout the world by governments and business to handle 
large statistical jobs. 
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SOURCES OF STATISTICS 

Primary and Secondary Sources. The original collector of 
data is their primary source. Generally si)eaking, data obtained 
from a primary source inspire greater confidence than the same 
data taken from a secondary source. The primary source is 
presumably the one sure place to find the exact definition of the 
units of observation involved. Subsequent reproductions of 
the data may fail to reproduce this essential information and 
lead to a misunderstanding of the true meaning of the data. 

The United States Bureau of the Census is the primary source 
of population data, of census data in general, and of all the 
statistical data published by the United States h)epartment of 
Commerce, for the Bureau of the Census is the data-gathering 
agency of the Department. The Bureau of Foreign and Domestic 
Commerce, on the other hand, is a large retailer of statistical 
data gathered not only from the records of the Bureau of the 
Census but also from numerous other governmental and non- 
governmental sources. While governmental publications are 
thus not uniformly primary sources, they are usually very 
careful to give exact reference to the primary sources and to 
define units adecjuately. 

In some cases, secondary sources may be better than the 
primary. Such is the case when experts presumably better 
qualified than the general run of statistical researchers have 
selected the good statistics from the poor ones in some primary 
source that may be either obscure and difficult to obtain or of a 
highly technical nature. Occasionally a secondary source 
performs the valuable function of selecting data impartially 
from primary sources that are biased in one way or another. 
Sometimes it is necessary also to be on guard against bias in 
government sources.^ 

1 Hinricks, a. F., Statistical Bias in Primary Data and Public Policy,’^ 
Jourrial of the American Statistical Association^ Vol. 33 (1938), pp. 143-152. 
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Natural Sciences. After the development of the statistical 
theories of gases (Charleses law, Boyle^s law, Avogadro^s law, 
the ,work of Gay-Lussac, and the like) the physical sciences and 
arts accumulated source materials of a statisti(;al character. 
Beginning with the last quarter of the nineteenth century, 
biology and zoology also accumulated source materials of a 
statistical character when a group of English biologists concluded 
that mass observation was necessary for the successful solution 
of their problems.^ 

Nongovernmental Sources. Statistical data of the natural 
sciences consist to a large extent of hypothetico-observational 
or experimental data. The principal sources of these data are 
handbooks of the special fields of study and monographs written 
by scholars at the great centers of research. For example, 
sources of astronomical data are the observatories located in 
various places throughout the world. The sources of currently 
discovered data in biology, physics, and chemistry are the 
laboratories maintained by universities, by private business 
enterprisers, or by such institutions as the Smithsonian Institu- 
tion at Washington, D.C. 

Additional pi'imary sources of statistics in the natural sciences 
are the several hundred technical journals, publications of the 
learned societies, trade journals, jmblications of commercial 
research organizations, college bulletins, and the publications of 
endowed research (uitcrprises. Fortunately for those who desire 
to make use of them, the data currently accumulated in such 
sources are summarized or abstracted in publications that main- 
tain sections of their respective issues for the purpose.^ 

Statistical data for the natural scienc(^s are also found in 
handbooks for the numerous special fields of study. For example, 
there are handbooks in medical entomology, physical therapy, 
geology, botany, experimental physics, and geophysics.^ 

^Anderson, O. N., “Statistical Method,'^ Encyclopaedia of the Social 
Sciences. 

* A partial list of such abstracting agencies is as follows: Science Abstractgy 
Abstracts of Geology^ Abstracts of Bacteriologyy Abstracts of Chemical 
Papers, Zentralblaii fiir Mathematik, Jahrhvch liher die Fortschritte der 
Mathematiky Physikalische Berichte, and Biometrika. 

® Handbook of Physical Therapy ( 1939 ); Handbuch der allgemeinen Chemie, 
unter Mitwirkung vieler Fachlente ( 1918 - 1937 ); Handbuch der Experimental- 
physik ( 1926 - 1935 ), 43 vols.; Handbook for Chemistry and Physics. ^ 
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Governmental Sources. Sources of data in the natural sciences 
are enormously supplemented by governmental agencies. The 
government weather bureau supplies current and historical 
data important to many kinds of research in such natural sciences 
as botany, zoology, and geology. The Minerals Yearbook, 
published by the United States Department of the Interior, is a 
source of data for natural scientists. The Geological Survey 
is a source not only of geological data but of data on electrical 
power production and other information useful to engineers. 
Engineers also find that government agencjies are sources of 
statistics on railroads, flood control, roads, and other similar 
subjects having to do with construction. 

Biologists find the chief source of modern vitality statistics 
of all sorts among the publications of governmental agencies. 
An important source of statistical data for medical men results 
from medical research recorded in the files of hospitals, some of 
which are governmentally operated. 

The quantity of statistical data relating directly to the natural 
sciences is thus large, but the natural sciences in addition make 
extensive use of the highly organized mass of statistical data 
collected largely by social scientists. Scholars in the natural 
sciences frequently make use of statistics concerning social and 
economic events. It is not at all uncommon for data concerning 
the behavior of human beings to enter into the calculations of 
engineers, physicists, and chemists engaged in practical business 
enterprise or pure research. Some illustrations were given in 
Chap. I. 

Social Sciences. Genesis of Statistical Sources. The increasing 
complexity of economic and social life has furnished the motive 
for the systematic marshaling of statistical data about human 
society; and, in addition, the dynamic quality of modern life 
makes it necessary to repeat statistical enumeration frequently in 
order to have knowledge of current facts and, what may be 
more important, knowledge of change. In the static conditions 
o^ earlier times one public fact-gathering enterprise could serve 
for years as a basis for judgments and for political and social 
action. Under modern dynamic conditions, this is not the case. 

In a democracy the timing of governmental action is dependent 
on the consent of the people, and that requires widespread 
knowledge of many economic and social facts and their inter- 
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pretation. If democracy is to preserve its high standards of 
achievement, its powers of expression in the face of tremendous 
forces that appeal to sentiment rather than to reasoned judgment, 
adequate factual information must be in the hands of the voters 
and of their governmental administrators and representatives 
in time for necessary action. Modern business enterprisers, 
too, faced with rapidly changing conditions, must lean more 
and more heavily on statistics to point the way to the solution 
of their problems. 

During a great national crisis, such as a severe depression or a 
war, the value of statistical data is enormously enhanced. In 
depression periods, published statistical data from governmental 
sources, which in retrospect appear to have been but a trickle in 
prosperity, swell to flood pro{)ortions. Modern war, moreover, 
as well as being a war of supply/^ a war of machines,'^ or a war 
of production, is a ^Svar of statistics.^^ The fact that much 
of the increasing wartime volume of statistical data is confidential 
explains the apparent and deceptive appearance of fewer sta- 
tistics in wartime than in peacetime. During the Second World 
War the statistics published by the United States Bureau of the 
Census, for example, sharply decreased because its organization 
and equipment were almost fully employed doing wartime sta- 
tistical work, especially for such agencies as the War Production 
Board and the Office of Price Administration. 

So diligent have been the efforts to obtain current knowledge by 
means of statistics during the i)ast fifty years that a vast source of 
raw material now exists, covering many fields of knowledge. 
Elementary acquaintance ^nth these sources is essential to all 
those who hope to work in either the natural or the social sciences. 
Complete familiarity with sources of statistics can come only with 
long practice in their use. It would be futile to attempt to impart 
to the student this desirable familiarity by giving a complete 
description of all sources. 

The Pattern of Statistical Sources, The student cannot hope to 
memorize the names of all sources of statistics; indeed, the 
attempt would not be useful, for the names change and new ones 
are added as time goes by. Comprehension of the pattern of 
development of statistical sources, however, will enable the stu- 
dent to become a scholar who, when confronted by a statistical 
problem, will have acquired a “statistical sense that will^guide * 
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him to the appropriate sources. This presumption explains why 
the present chapter on sources is given an historical or a genetic 
setting. Let the names of all the statistical agencies be chaijgcd 
by the Second World War, and the study of the hivstorical and 
genetic explanation of statistic.al sources will still help the student 
acquire that scholarly ability required to locate sources; he will 
hav^historical perspective to facilitate his prompt understanding 
of the postwar world of statistical sources. Tn any case, the 
period between the First and the Second World War will long 
continue to be one intensively studied by statisticians of coming 
generations. 

In the ensuing description of sources of statistics, which is 
presented in its historical or genetic aspects, governmental 
sources are given more space than nongovernmental sources, 
because the general statistician deals mostly with the former. 
While the specialized statistician must acciuire detailed knowledge 
of sources in his special field, he also needs to be familiar with 
governmental sources in his field. Governmental sources, more- 
over, are themselves one of the best guides to the successful use of 
nongovernmental sources, because many governmental agencies 
are secondary sources that give complete and very useful descrip- 
tions of the primary sources used. 

The motive underlying the gathering and publication of 
statistics by private enterprise has usually been the profit 
available through the sale of such statistical information to 
commercial, banking, and manufacturing or distributing enter- 
prises. In many instances these services emergjed as incidental 
features of existing publications; an example is the increasing 
amount of statistics of all kinds published in newspapers and 
periodicals. In other instances the statistical feature was the 
original purpose of the publication; many trade journals are 
cases in point. 

The state and privately endowed universities of the nation are 
important sources of statistical research, especially of a pioneering 
character, in all branches of knowledge — some being famous for 
certain fields of statistical work. 

During recent years one of the most striking aspects of ^M:)ig 
business^' development has been the maintenance of research 
organizations contributing to statistical knowledge, a fact that the 
publi(? was not permitted to forget as it visited the 1939 World^s 
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Fair in New York and read newspaper and magazine wartime 
advertisements in the early l940^s. Some corporation-financed 
research organizations, primarily intended for profit making, have 
incidentally contributed in important ways to the advancement of 
scientific statistics in engineering, business, and the use of agri- 
cultural products. Most of the pioneering statistical research 
in agriculture, however, as well as in labor organization, wages, 
and the like, is done by governmental units or by the govern- 
mentally sponsored agricultural experiment stations connected 
with various state colleges or universities. 

The motive underlying governmental activity in the collection 
and publication of statistics has been to increase knowledge of 
facts so that administrators may adjust government action to the 
changing needs of a dynamic society, so that democratic repre- 
sentatives of the people ma}^ legislate more expeditiously and 
wisely, and so that the voters in a democu-acy may have the 
opportunity to know the facts. Tn recent years, a great expan- 
sion in the governmental activity of collecting and publishing 
statistics to aid business enterprise has occurred. In short, 
governmental statistical agencies assist both publie and private 
economic planning. The large (piantities of statistical informa- 
tion released by the Department of Labor and Department of 
Commerce are eagerly awaited by business enterprisers seeking to 
k(^ep up to date in their methods, labor policies, coverage of 
potential markets, and knowledge of desirable sources of raw 
materials. True, their zc^al in filling out the questionnaires that 
constitute the sources of the desired statistics sometimes falters, 
but on the wljiolc businessmen recognize the truly cooperative 
character of the system of collecting and disseminating business 
statistics and stoic^ally endure the barrage. 

As a consequeiujo of the manner of their liistorical and genetic 
origin, therefore, modern statistical sources in the United States 
fit into a pattern that is more or less uniform among the various 
fields of knowledge. This pattern is roughly as follows: 

Uoscarch of private enterprisers: ^ 

Individual (uitcrpriscrs. Special monographs, articles, and other con- 
tributions are made by individuals and published under the sponsor- 
ship of universities, professional publications, and the like. 

Research associations. Quantities of statistical data are collected by 
research organizations, some of which are hired by corporate or ^ 
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noncorporate private enterprise*' in the business world, some con- 
nected with universities, and some indeptmdently endowed. 

Commercial sources, t.e., privately financed publications: 

These sources are in the business of collecting and publishing stati||tics 
as a profit-making enterprise; they include 

Trade journals. 

Commercial and financial periodicals and services. 

Official publications of the government: 

Federal or national governmental agencies. 

Local, i.e., state or municipal, governmental agencies. 

Guides to Sources of Statistics. If a trained professional 
librarian is available for consultation, he is the best informant on 
the subject of guides or handbooks to all general fields of research. 
However extensive may be the experience and training of the 
research scholar, he finds himself continually relying upon the 
local librarian, who makes a specialty of keeping posted on new 
developments with respect to handbooks and literary guides of all 
kinds. 

Guides to Nongovernmental Statistics. Practically every con- 
ceivable occurrence in the world of man or beast, in the heavens, 
on the ground, under the ground, on the sea, under the sea, or in 
astronomical space holds an interest for some individual or group 
of individuals; either as a hobby or as a means of livelihood some 
individual or group of individuals is now and has for many years 
been collecting statistical facts about all these world events. The 
existing sources of statistics necessarily therefore appear at first 
glance to be an unwieldy mass; but, fortunately both for begin- 
ners and for practiced scholars, this mass has been for some time 
culled over and classified, indexed and cross-indeftced by various 
types of handbooks, yearbooks, or guides of one sort or another. 

The general magazine indexes constitute one class of such 
guides; the principal ones are as follows: 

Agricultural Index. 

Education Index. 

Engineering Index. 

Industrial Arts Index. 

^Public Affairs Information Service. 

Readers^ Guide to Periodical Literature. 

Such indexes or guides are compiled monthly and cumulated 
.into annual volumes, and articles of a statistical character appear- 
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ing in a comprehensive variety of journals and trade magazines 
can be discovered by the intelligent use of these alphabetically 
arranged indexes. The above-listed indexes are not specifically 
organized as guides to statistical sources; their collective purpose 
is as broad in scjope as all modern knowledge, but one of their 
varicjd uses is to serve as guides to sources of statistics. 

Indexes or handbooks specifically dedicated to serve as guides 
to sources of statistics do, however, exist in considerable number. 
In 1937 the United States Department of Commerce published 
‘^Sources of Current Trade Statistics’’ {Market Research Series 
13), which lists practic^ally all current trade statistics by govern- 
mental and nongovernmental agencies; this handbook was 
designed for the use of manufacturers, distributors, financial 
institutions, advertising agencies, trade associations, bureaus of 
business research, and individuals engaged in research work. In 
1942 the United States Department of Commerce published a 
handbook entitled Trade and Professional Associations of the 
United States, Avhich lists the sources of practically every conceiva- 
ble type of trade statistics compiled by nongovernmental agencies. 

In 1934 a scholarly attempt was made by Gerlof Verwey and 
D. C. Rciiiooy to construct a manual of statistical sources under 
the title The Economist^ s Handbook; this book was published in 
Amsterdam, Holland, and a supplement appciired in 1937. It is 
a guide to statistical sourcevs on economic subjects, covering 
Belgium, France, Germany, the Netherlands, Switzerland, the 
United Kingdom, and the Unitcjd States. In the United States, 
D. H. Davenport and F. V. Scott were authors in 1937 of An Index 
to Bminess Ifldexes, a book containing information about the 
many indexes used in business, including the name of the compiler, 
description of the index, frequency of publication, period covered, 
and the name of the publication in which current data appear. 
In 1937 the Special Libraries Association published a handbook 
Guides to Business Facts and Figures in which Part III is an index 
of statistical sources of information. 

A multiple assortment of handbooks in various special fields 
serve as guides to statistics in each special field of knowledge, 
along with other purposes for which the handbooks are issued. 
For example. Management Handbook, Flitcraft, and Handbook 
of Accountants serve, in their respective fields, as guides to 
statistical sources. • 
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Often the purpose of a handbook or index of sources of sta- 
tistics is served by one of the numerous abstracts of statistical 
data. The Statistical Abstract of the United States ^ published 
annually by the United States Department of Commerce, is 
itself a source of statistics, but it is also an index to sources 
because at the head of or in a footnote to each table of data it 
records the primary source from which the data are obtained. 
Similarly, the World Almanac, which for 58 years has been pub- 
lished by the New York World or the New York World-Telegram, is 
itself a source of statistics and also a guide to sources for the same 
reason. 

Guides to Governmental Statistics, Many of the handbooks 
serving as guides to statistical sources compiled by nongovern- 
mental agencies, include also in their alphabetical indexes a large 
range of governmental statistical sources as well; but there are a 
number of important handbooks specifically intended to serve as 
guides to the maze of governmental sources of statistics. The 
best-known and most comprehensive guide is the United States 
Government Manual, published by the government. Fn 1938 the 
Central Statistical Hoard (later the Division of Statistical Stand- 
ards of the Bureau of the Budget) published a Directory of Federal 
Statistical Agencies, The Central Statistical Board was organized 
in 1933 in order to find some means of coordinating the various 
types of Federal statistics.^ The business of the Central Sta- 
tistical Board was to serve as an agency for the reorganization in 
collection, tabulation, and use of Federal statistics. It was hoped 
such an agency could help solve the problem of overlapping in 
statistical function, which caused unnecessary '"burdens upon 
respondents to questionnaires and which also resulted in ineffi- 
ciency in the utilization of statistical information. 

In response to a request by the President in a letter of May 1 6, 
1938, the Central Statistical Board made a report on the question 
as to whether or not it is possible to reduce the amount of duplica- 
tion in statistical reports. The board concluded that much 
could be done in the way of coordinating the gathering, tabula- 

'’it • 

^ In the task of perfecting Federal statisti(;s the government has received 
the advice of scientific professional associations. See American Statistical 
Association and the Social Science Resear(*,h C'ouncil, Government Statistics: 
A Report of the Committee on Government Statistics and Information Services 
. ( 1937 ). 
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tion, and presentation of Federal statistics; by such coordination, 
comparability in definition would bring about a great improve- 
ment in the efficiency of data collected. With reference to the 
reduction in the amount of duplication, however, the board 
concluded that a majority of the financial and other statistical 
reports and returns made l)y the public to the Federal govern- 
ment are iinddcmtal to the administration of governmental 
functions; the statistics are a by-product of either administrative 
or control functions of the government. ConscKiuently, the 
board recomniendcHl tliat the Federal statistical and reporting 
services should remain largely de(;entralized so that they may be 
associated with the respective governmental functions to which 
most of them specifically relate; but that there is a continuing 
need for a statistical coordinating agency, with a specially 
trained staff and with broad powers.^ One important result 
of the coordinating functions of the Central Statistical Board 
was the publication of a directory of federal statistical agencies, 
which has already been mentioned. 

A general guide to government publications, Anne Morris 
Boyd\s United States Government Publications (1941), serves 
incidentally as a guide to governmental sources of statistics. 
This book also giv(\s an analytical picture of the (‘haracter and 
scope of government publications. The same may be said 
regarding Laurence F. Schmeckebier\s Government Publications 
and Their Use (1939). 

RESEARCH OF PRIVATE ENTERPRISERS 

Individual Bhterprise. Pioneers. In spite of the fact that 
Domesday Book was an eleventh-century product and that even 
earlier exami)les of governmental collection of statistics can be 
cited, it remains true both historically and (airrently that the 
pioneer work of converting public records into statist i(?s Ls non- 
governmental. The pioneers have been and are individuals. 
The father of modcum vital statistics is John Graunt, who in the 
seventeenth century made statistical investigations that served 
as the basis for founding life insurance. Another seventeenth- 
century scholar. Sir William Petty, was the outstanding pioneer 
in developing statistics for the social sciences. Both these 

^ Report of the Central Statistical Board, 76th Congress, 1st Session, 
House Document 27, Jan. 10, 1939. 
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men were associated with the early development of the Royal 
Society of London ^ which was incorporated in 1662 and is the 
oldest of modern learned societies. 

Pioneering in the art as well as the science of statistics con- 
tinues in modern times to be highly individualistic. This is 
exemplified by the work of Karl Pearson in England^ and in the 
United States by such men as Wesley C. Mitchell and his works 
on index numbers and the business cycle, Warren Persons and 
his work on the statistical analysis of business statistics, and 
many others.^ Individual contributions are commonly presented 
in the publications of learned societies, such as the Journal of the 
Royal Statistical Society, the Journal of the Statistical Society of 
London (founded in 1834), and the Journal of the Aynericayi Staiis” 
tical Association (founded in 1839). These and the publications 
of other learned societies are indexed in tlie guides mentioned 
earlier in this chapter. 

Research Associations. During the 1920\s and 1930^s a num- 
ber of important research organizations in the field of economics 
and social institutions were organized. The Brookings Institu- 
tion in Washington, D.C., the Harvard Committee on Economic 
Research, the National Industrial Conference l^oard, the National 
Bureau of Economic Research, and the Cowles C/ommission 
were among the most prominent. 

The Harvard Committee on Economic Research was organized 
in 1919 to study business trends and cycles and to publish a 
scientific business forecaster; its work was launched under the 
leadership of Warren Persons. In addition to the forecasting 
service, this research organization publishes the ^Review of Eco- 
nomic Statistics (quarterly) and once or twice a year a summary 
of statistics called the Statistical Record. 

The National Industrial Conference Board was organized by a 
group of comparatively public-spirited manufacturers to study 
the various problems of employer-employee relationships, leading 
them into special studies of real wages, income distribution, and 
general economic conditions. It publishes its studies in the 
foftn of books appearing as they are written. In addition to 
the subjects mentioned above there have been National Indus- 
trial Conference Board books on cost of living, statistics of income 

• See Chaps. XIII and XIV. ' " ' ‘ 

■ » See C haps. X IX and XX. 
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by states, and availability of bank credit. The National Indus- 
trial Conference Board has also published since 1940 The Eco~ 
nomic Almanac, which is a widely used annual. 

^he National Bureau of lOconomic Research was founded in 
1920, sponsored by a group who believed that a purely dis- 
interested approach is desirable and that no group should control 
the findings of this new statistical organization. It is so con- 
stituted as to produce this desirable result. A numl)er of special 
studies of economic and social conditions have been made and 
published under its auspices and some in cooperation with the 
government. For several years it has occasionally issued 
bulletins containing data resulting from studies that usually 
appear later in more detail in book form. 

The nature and accomplishments of the National Bureau of 
Economic Research arc indicated by the following (quotation from 
the twentieth annual report of the director of research:^ 

The National Bureau was established by men who believed that it is 
becoming possible to apply quantitative methods to the study of eco- 
nomic behavior. They realized that this field is far more difficult than 
the fields in which science has won its major triumphs and demonstrated 
its pra(;tical usefulness most conclusively. Also they reciognized that 
investigators cannot experin^ent at will upon society; though society 
cjin and does experiment loosely upon itself. . . . Economics was not 
likely to grow faster at this turning point in its career than its elder 
sisters [the natural sciences]. But at the close of the First World War 
the materials for observing ac.tual behavior wxre multiplying so rapidly 
and analytic methods of extracting significant conclusions were becoming 
so versatile and^)owerful that our founders thought their staff had good 
prospects of rendering valuable service at once. Also they hoped that 
one modest success would lead to others, fostering cumulative grow^th 
of the kind that has characterized systematic research in other fields. . . . 

Twenty years of effort along the lines laid down in 1920 have con- 
firmed our faith in the social value of what the National Bureau set out 
to do. Our accomplishments have not been spectacular, but they have 
been substantial, and they afford a secure foundation on which to build 
in future. We have more reason than ever to believe that in trying to 
establish a few economic fundamentals firmly w'e are aiding thoughtful 
men of all persuasions to plan wisely. If tested knowledge is the safest 
and surest guide in practical affairs, our work has social meaning, how- 

^ Mitchell, Wesley C., ‘‘The National Buroau^s Social Function,'' 
March, 1940, pp. 13-15, 19. 
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ever technical its character. . . . We hold that advance will be rapid 
and continuous in proportion as the workings of our economic system 
are understood. In trying to replace speculative opinions about eco- 
nomic relations by conclusions resting upon evidence we are expediting 
progress in the most effective manner we know. 

. . . Another device, peculiar to the National Bureau, is to select 
directors who have divergent views on public policy and give each an 
opportunity to criticize every manuscript. That device has been of 
inestimable help to us in keeping our reports nonpartisan and therefore 
worthy of credence by the public. Having such a board we cannot 
expect unanimous consent from its members to many policies that 
individuals among us favor. But the mere fact that the National 
Bureau never takes sides upon controversial issues adds its bit of pro- 
tection against bias in our publications and helps toward meriting and 
winning public confidence. 


The more thoughtful sections of the public we are now reaching in 
various ways. Physical scientists are coming to recognize the con- 
tributions of research in economics; for example, in I Believe Robert A. 
Millikan says 

“In economics and the social sciences long and elaborate statistical 
studies must be made in order to eliminate the disturbing factors and 
thus obtain the controlled conditions. We are just beginning to have 
available, through the National Ikireau of Economic Research and other 
similar agencies, a large amount of such definite, dependable, statistical 
knowledge in economics.^' 

The Twentieth Century Fund is another research association 
organized to function in a manner similar to that of the National 
Bureau of Economic Research. It publishes occasional pam- 
phlets or books. 

THE COMMERCIAL SOURCES 

In addition to the numerous sources of statistics resulting from 
individual or group research such as those described above, a 
great quantity of statistical sources has come into existence as 
the result of the activities of those who go into the business for 
the profit of collecting and selling statistical data. Such are the 
trade journals and the commercial and financial periodicals. 

Trade Journals. A large number of trade journals are actively 
engaged in collecting statistical data for various types of enter- 
‘ prise. The Iron Age, for example, founded in 1855, is the trade 
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journal for the iron and steel industry, publishing statistics on 
iron and steel production in all states and the prices of iron, steel, 
coijper, zinc, etc. Another example is Wileman^s Brazilian 
Review, which is the trade journal for coffee. The trade journals 
are frequently used by governmental statistical organizations, 
such as the Bureau of Labor Statistics, the Department of 
Commerce, and the Board of Governors of the Federal Reserve 
System, as the primary sources of particular data assembled 
by them. Occasionally the trade journals will publish in special ^ 
pamphlet form or in books assembled data of the trade. 

During the 192()^s, a large expansion in the collection and 
publication of statistics in various lines of economic activity on 
physical commodity production and distribution took place. 
In a few instances this work was done by private companies. 
Thus Seidman and Seidman compiled data on furniture for the 
Grand Rapids district, and R. L. Polk and Company compiled 
data on new cars registered; the function of the latter was subse- 
quently taken over by Ward^s Automotive Reports, Many 
such series were compiled by the trade journals from public 
records. The Iron Age compiled data on physical quantities 
of production of pig iron, and the Statistical Sugar Trade Journal 
publisluHl quantitative sugar statistics. 

Trade Associations. Most of the production and distribution 
scries are comjMled by the various trade associations, such 
as the American Fac.c Brick Association (merged with the 
Structural Clay Products Institute), the American Paper and 
Pulp Association, and the United States Cane Sugar Refiners^ 
Associatiour 

The production and distribution statistical series are of various 
types. Some measure the How of commodities through the 
process of production and distribution, for example, data on 
raw material received or consumed, like the figures on cotton 
consumption by textile mills or on cattle receipts at stockyards. 
Others give a measurement of quantity or stock of a commodity 
on hand. Still others are figures on the amount of orders or 
sales of the product, such as the unfilled orders of the United 
States Steel Corporation. As noted elsewhcj'e, many of these 
series are collected from their original sources and published 
by the United States Department of Commerce in the Sw'vey 
of Current Business, Consequently, the appendix of the Survey " 
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contains a description of about every important commercial 
source of statistics. In fact, the Department of Commerce 
publishes a description of such statistical sources.^ Frc(iuently 
a trade association Avill publish a sort of handbook or abstract 
of statistics for the trade, covering historical as well as current 
statistics. 2 

Commercial and Financial Publications. The commercial 
and financial journals and services are also too numerous to 
mention in detail, but a few may be described as typical. Among 
these are the Commercial and Financial Chronicle (weekly), the 
New York Journal of Commerce (daily), the Wall Street Journal 
(daily), BradstreeVs (merged in 1933 with Duties) y Babson^s 
ReportSj Moody^s Investors* ServicCy Standard & PooFs Corpora- 
tiony Brookmire Economic Service, and the Dodge Statistical 
Service. 

While there is much overlapping of published commercial 
and financial statistics through these various publications and 
services, nevertheless each has become noted for especially good 
statistical service in a particular line. For example, the user of 
business-failure statistics thinks first of Bradstreet\s, because for 
many years the data that it has published on business failures 
have been widely used. Bradstreet’s was also famous for its 
index of wholesale prices for the United States, being a })ioneer 
in the development and publication of such an hidex. Babson\s 
and Brookmire\s services are noted for business forecasting and 
for investment services and forecasting the stock market. The 
New York Journal of Commerce is noted for its current data on 
new securities issued and on the produce marl&ts, The New 
York Times is noted for its index of business activity, which 
was published in the Annalist (weekly) until that periodical 
was discontinued. The Commercial and Financial Chronicle is 
particularly useful for its detailed array of current data on bank 
clearings, business failures, interest rates, stock and bond prices, 
corporations, capital stock and bond issues, and the money 
markets of the world. This remarkable publication can be 

ft 

^ “ Sources of Current Trade Statistics , Market Research Series 13 (1937). 

* United States C^^tie Sugar Refiners^ Association, Sugar EcouomieSy Sla- 
tistiesy and Documents (1938). 

® Often referred to in footnotes as Dow, J ones and Company y which some- 
i times mystifies beginners. 
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traced in its lineage back to 1820, when it started as Niles' 
Weekly Register^ famous as an early preacher of the doctrines of 
higl^ tariffs and the American system.” From 1839 to 1865 it 
was called Hunt's Merchant's Magazine. Since 1865 it has 
gone under its present name. The financial statements of all 
kinds of corporations, togc'ther w'ith other statistics and corporate 
histories, are to be found in Moody^s Manual of Corporations. 

The Commodity Yearbook is published })y the Commodity 
Research Bureau, New York, N.Y. This is a private organiza- 
tion devoted to th(^ dissemination of accurate information on 
commoditi(^s and other related subjects, including production, 
consumption, prices, stocks, imports, exports, etc. Some are 
annual, some ar('. monthly data. 

All the abov(i-dcscril)ed sources are extensively used by 
American and foreign business enterprisers, whose subscriptions 
to them and advindisiiig in them make possible the vast statistical 
undertakings on a profit abh' basis. The fact that they are so 
supported would seem to prove the value of statistics to modern 
busiiKiss enterprise. 

OFFICIAL PUBLICATIONS OF THE GOVERNMENT 

Federal Statistical Agencies. Department of Commerce. The 
Department of Commerce is one of the greatest fact-gathering 
oi'ganizations of the Federal government, if not the greatest. 
It contains a number of bureaus chiefly engaged in the dissemina- 
tion of facts concerning not only commerce but economic and 
social life in general. The Bureau of the Census is the fact- 
gathering agency of the Department. 

The Articles of Confederation provided for the taking of a 
triennial census, but the Constitution of the United States 
provides for the taking of a population census every 10 years, 
to serve as the basis for Congressional apportionment. The 
first one was taken in 1790. The broad practical and scientific 
purposes that the census today serves were not in the minds of 
the American founders, and the earlier census })ublications were 
meager affairs compared with the modern census.^ The censds 
of 1790, for example, returned the number of free white males 
over, and the number under, sixteen years of age, the number of 

^ Cummings, John, ^‘Statistical Work of the Federal Government of the 
United States, '' in Koren, History of Statistics, pp. 670-672. 
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free white females without distinction by age, all other free 
persons, and slaves — without, in the case of the last two classes, 
distinction by either sex or age. The published census of 1790 
consisted of a volume of 52 pages. At the census of 1800 and of 
1810, five age classes were distinguished and the age classification 
was extended to white females. In addition, at the census of 
1810, some facts were compiled relating to manufacturing estab- 
lishments, their number, nature, extent, situation, and value. 
A digest of the results of these data was prepared by Tench 
Coxe and published in 233 pages. The census of 1 820 introduced 
the idea of collecting occupational statistics, calling for enumera- 
tions of persons engaged in agriculture, commerce, and manu- 
factures. The census of 1830 returned to the original idea of 
obtaining merely a population enumeration; but in 1838 Presi- 
dent Van Buren suggested to Congress in his annual message 
that the census should be extendenl so as to include ^ ^authentic 
statistical returns of the great interests specially entrusted to or 
necessarily effected by the legislation of Congress.^^^ As a 
result. Congress provided in the act for the Sixth Census (1840) 
that the marshals should return in statistical tables ... all 
such information in relation to mines, agriculture, commerce, 
manufactures,- and schools, as will exhibit a full view of the 
pursuits, industry, education, and resources of the country.”-^ 
Congress overreached the capacity of those entrusted with the 
task of census taking, for the census of 1840 is famous for its 
inaccuracies. At the census of 1850, improvements in the 
organization of collecting and compiling the statistics were 
made; and, according to Cummings, with the tensus of 1850 
the decennial enumeration began to assume modern proportions 
and character. 

One of the outstanding American economists of the nineteenth 
century, Francis A. Walker, was a pioneer in developing the 
census to what we understand it to be now. He did particularly 
notable work in perfecting the organization and presentation 
of statistical data in the Tenth Census (1880), of which he had 
charge. 

At the Eleventh Census (1890), machine tabulation was 
introduced (the Hollerith tabulating machines), at a great 

1 Ibid., p. 672. 

2 Ibid,, pp. 672-675. 
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saving of time and expense. The printed reports of the census 
of 1890 aggregated 21,410 pages, in 25 quarto volumes, the final 
report being issued in 1897. The Bureau of the Census was 
evstablished as a permanent one in 1902 and since that time has 
been in continuous operation as a great fact-gathering organiza- 
tion for the national government. The tendency since that 
time has been to confine the decennial census to the major 
subjects of population, manufacturing, agriculture, mines, and 
quarries, and in intervening years to take censuses of business. 
In intcrcensus years the Bureau also has charge of the annual 
collection of mortality data, statistics on religious bodies, the 
collection and compilation of statistics of cotton and tobacco, 
and the annual compilation of statistics of cities of 30,000 popula- 
tion and over, and financial statistics of states.^ 

After 1902 the census of manufactures has been taken every 
5 years until 1919 and since 1919 every 2 years until 1939. The 
census of agriculture has been taken every 5 years since 1910. 
The Statistical Atlas (containing graphic illustrations of much of 
the census data) was first issued in 1874 [based on the Ninth 
Census (1870)] and has appeared irregularly since that date. 

In 1929 a census of distribution as well as of manufactures 
was taken; but when the National Recovery Administration 
began operations, many of the data assembled in the census 
year 1930 were out of date owing to the sharp business recession 
and the increase of unemployment following that year. Along 
with the regular biennial census of manufactures for 1933 the 
Bureau of the Ckmsus undertook an extensive census of business 
of types other fhan manufacturing, such as amusements, service 
businesses, barbershops, beauty parlors, repair shops, and tourist 
camps, covering more than 2,400,000 individual establishments. 

^ By order of th(? Secretar}’^ of Commerce, the collecting of financial sta- 
tistics of states was discontinued temporarily after the 1931 report. With 
no comparative basis provided by the statistics for smaller cities and no 
individual reports for states, the remaining reports were of greatly reduced 
value. A detailed analysis was therefore made of the needs for data in this 
field and of the Bureau’s past and present inquiries. Closely related reports 
were prepared for the director by the Central Statistical Board, the Advisory 
Committee to the Director of the CVnsus, and the Municipal Finance 
Officers’ Association of the United States and Canada. Accordingly, the 
Division of Financial Statistics of States and Cities was reorganized in 1936. 
Annual Report of the United States Bureau of the Census, 1937, pp. 23-24; 
1938, pp. 28-29. 
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For subsequent biennial dates the census of business was 
further developed. The census of business covering the calendar 
year 1935, for example, was much broader in scope than either 
the census of distribution of 1929 or the census of American 
business for 1933. The 1935 census of business attempted to 
obtain a reasonably complete picture of essential and com- 
parable items of business information concerning practically all 
lines of business activity in the United States. It comprised 
a complete census of retail and wholesale trade, service businesses, 
amusement enterprises, hotels, broadcasting stations, advertising 
agencies, banking, insurance, real estate, bus transportation, 
trucking, warehousing, construction, and distribution of manu- 
facturer's sales through primary channels. 

Elaborate care was exercised in preparing the 17 schedules; 
before final use they were submitted for criticism to representa- 
tives of the business groups and governmental agencies prin- 
cipally concerned. Special efforts were made by the Bureau to 
integrate the census of business and the biennial census of 
manufactures by the adoption of common definitions, instruc- 
tions, area designations, and field procedures. In order to 
perfect procedure, conferences were held to discuss schedules, 
procedures, and other problems inherent in such an expanded 
business census. These conferences were*attended by representa- 
tives of trade associations, professional groups, chain-store 
organizations, etc., and by official representatives of a number 
of governmental agencies — the Central Statistical Board, 
Interstate Commerce Commission, Bureau of Foreign and 
Domestic Commerce, Tariff Commission, Federal Reserve 
Board, and Bureau of Labor Statistics.^ 

The population schedule for the census of 1940 is notable 
for a number of new questions concerning employment status, 
migration, income status, housing, and education. It is also 
notable for the innovation of the sampling technique applied 
to one group of questions in order to widen the scope of the 
inquiries. It dropped the question on literacy. 

*• Employment and unemployment queries have been made in 
previous censuses, but the 1940 census made a new approach. 
The new data permit classification of the nation's labor force 

^ Annual Report of the United States Bureau of the Census, 1936, pp. 
19-21. 
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into the employed, the unemployed who have had previous work 
experience, and the unemployed without previous work experi- 
ence — new workers. They provide some measure of the volume 
of er&ployment both during the whole year and during the week 
prior to the census day, Apr. 1, 1940. 

The schedule included questions that distinguish people at 
work, peoples unemployed who are seeking work, and people who 
have a job but are not at work because of temporary illness, 
industrial disputes, or vacations. Persons at work were asked to 
indicate the number of hours they worked during the week 
preceding the census, and the unemployed were asked to state 
the number of weeks they had been seeking work. Workers 
were classified as to whether they were in j)rivate industries or 
were emi)loyed by the government and whether they were own- 
account workers or unpaid family workers. 

The new inquiry on wages and salaries is important as a 
measure of national purchasing i)ower and its distribution, and 
the resulting data have been helpful to business in indicating 
potential market areas. 

The net effects of internal population migration during the 
preceding 5 years were obtained by recpiesting the place of 
residence for eac^h person as of Apr. 1, 1935. It is expected that 
compilation of the statistics comparing such residence with that 
of Apr. 1, 1940, which is also recorded on the schedule, will 
measure the effects of industry shifts, droughts, depressions, 
floods, the backflow west to east, and the shift from the city to 
the country, or vice versa. 

In 1940, for rtie first time, the decennial census included a 
separate housing schedule designed to give detailed information 
for each dwelling unit in the United States, whether occupied or 
vacant, rural or urban. Data were obtained as to the number of 
rooms, water supply, bath and toilet facilities, and light equip- 
ment. For each occupied unit or household, information was 
obtained concerning the principal means of refrigeration used, 
the presence or absence of a radio, the character of the heating 
equipment, and the principal heating and cooking fuels used.* 
Each residential structure was described in respect to single, 
double, or multiple family occupancy, whether or not it contained 
a business unit, for what purpose and in what year it was orig- 
inally built, the principal exterior material of the structure, and 
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whether it was in need of major repairs. The schedule included 
a question on whether the family leases or owns, whether there 
is mortgage indebtedness, and methods of home finance. 

It is expected that the compilation of these data will provide 
valuable information on the latent purchasing power of a com- 
munity. There is no more important index of the social and 
economic status of a population than the standard of its housing. 
Housing experts believe that the information gathered will be of 
inestimable value in determining future housing policies. It 
will be of especial interest to manufacturers, builders, distributors, 
and bankers in their study of trends in home ownership and 
building in the United States. Cities will be able to determine 
the distribution of the various types of housing within their 
limits, together with the possible need of expansion of transporta- 
tion and communication systems, police and fire protection, 
schools, and similar facilities. Data showing the equipment in 
houses, together with the state of repair of the homes, will be of 
value to manufacturers and distributors of housing products 
in the planning of their sales campaigns.^ 

The agricultural schedules for the census of 1940 likewise had 
a number of new features. Nine regional schedules, each used 
in a separate group of states, were especially designed to fit 
national variations in cropping practices. Questions designed 
to obtain subtotals for the value of various major categories of 
farm products sold or traded in 1939 made possible a much 
closer estimate of total farm income and of farm income by 
principal sources. The 1940 census also introduced a supple- 
mentary plantation schedule for use in the cottoK belt that made 
possible a refined distinction between farms and plots cultivated 
by croppers and defined the exact status of each cropper and 
certain other tenants in relation to the plantation owner. Ques- 
tions to measure the effects of current agricultural policies were 
also asked, relating to soil-improvement crops, summer fallow, 
crop failure, and succession or interplanted double cropping. 

The Bureau of Foreign and Domestic Commerce is the great 
Federal fact analyzer and fact publisher in the Department of 
Commerce. It has a curious and rather complicated history. 
From the beginning of the national period, the statistics of 

^ Cf, The New York Times, Jan. 24, 1940. 
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foreign commerce were linked up with our tariff policy and main- 
tained by the Treasury Department. In 1856, growing out of 
an ijivestigation of the tariff policies of other countries by the 
State Department, there was created a Bureau of Foreign Com- 
merce as a permanent bureau for the purpose of collecting 
statistics on foreign trade. In 1866, the Bureau of Statistics 
of the Treasury Department was created to take special charge 
of this work, and at the same time Congress gave it power to 
collect statistics on domestic trade as well as on foreign trade. 
In 1905, a Bureau of Manufactures in the Department of Com- 
merce was organized to foster, i)romote, and develop the various 
manufacturing industries of the United States, and markets for 
the same at home and abroad, by gathering and publishing all 
available and useful information concerning industries and 
markets. 

As a consequence, there were bureaus in three separate depart- 
ments (Treasury, State, and Commerce) concerned with the 
gathering of foreign-trade statistics. In 1912, however, these 
functions were centralized in the Bureau of Foreign and Domestic 
Commerce of the Department of Commerce. 

The most important statistical publications of this bureau 
are the monthly Survey of Current Business (with a weekly sup- 
plement) and the annual Statistical Abstract of the United States. 
Special publications, designed to aid business are also prepared, 
for example, historical studies of industries, studies of the national 
income produced, and studies of market data.^ 

Other bureaus of the Department of Commerce are the 
Bureaus of Fisheries, of Patents, and of Navigation and Steam- 
boat Inspection, each of which publishes specialized statistics. 
The two great statistical organizations in the Department of 
Commerce, however, are the Bureau of the Census and the 
Bureau of Foreign and Domestic Commerce. 

Department of Labor. The United States Department of Labor 
also contains bureaus that publish statistics, the most important 

^Illustrations are P. W. Barker, Rubber Industry of the United States^ 
1839-1939 (1939); Division of Economic Research, National Income in the 
United StateSy 1929-35 (1936); B. P. Haynes and G. R. Smith, Consumer 
Market Data Handbook (1939). For other statistical publications of the 
Bureau of Foreign and Domestic (commerce, sec the U nited States Government 
Manual 
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from the point of view of quantity of data compiled and pub- 
lished being the Bureau of Labor Statistics. This was created 
in 1884 as the Bureau of Labor, although the Treasury Bureau 
of Statistics created in 1866 had been enjoined to collect wage 
statistics. In 1888 the Bureau of Labor was made an inde- 
pendent Department of Labor. The duties of the Department 
of Labor were to acquire and diffuse among the people of the 
United States useful information on subjects connected with 
labor, in the most general and comprehensive sense, and espe- 
cially on its relation to capital, the hours of labor, the earnings 
of laboring men and women, and the means of promoting their 
material, social, intellectual, and moral prosperity. The com- 
missioner of labor in charge of the Department was specially 
charged to investigate the causes of and facts relating to all 
controversies and disputes between employers and employees, 
and he was also empowered to make special studies of articles 
controlled by trusts and their effect on production and prices 
and other special subjects. Owing to the excellent work of the 
Department under the wise guidance of Carroll D. Wright, the 
first commissioner of labor, there is available a large mass of 
statistics in the field of labor for this country, including studies 
of strikes, the effect of the introduction of machinery on employ- 
ment and wages, the conditions of living and work of the laboring 
population, etc. Upon the basis of the wage and price data 
collected, index figures showing the trends of wages and prices, 
wholesale and retail, have been constructed and published by 
this bureau. 

In 1903, the old Department of Labor was transferred to the 
newly created Department of Commerce and Labor; but in 1913 
there was created a new Department of Labor, and in that 
department the Bureau of Labor Statistics. At the present 
time, the principal publications of the Bureau of Labor Statistics 
are the Monthly Labor Review (published since 1915), bulletins 
on special topics such as wholesale prices, retail prices, cost of 
living, wages, and labor turnover, and monthly serials to supple- 
ment the bulletins and give current information on those topics. 
Beginning in August, 1939, the Bureau of Labor Statistics pub- 
lished a daily index of 28 basic commodity prices at wholesale; 
but following the inauguration of wartime price controls this 
index was published only once a week since control in the raw- 



SOURCES OF STATISTICS 


79 


material field was widely effective. During wartime the index 
was of little importance. ^ 

Treasury Department, For the period before the Civil War 
the chief source of financial and price statistics in the United 
States, as well as data on governmental finance, consists in the 
finance reports of the Secretary of the Treasury. 

Before the development of statistical bureaus in the Depart- 
ment of Commerce and the 13epartment of Labor, the Treasury 
Department was the most important source of Federal statistics; 
and it is still important in the fields of banking and monetary 
statistics, owing to the work of the comptroller of the currency, 
and in the field of income and Federal taxation and indebtedness, 
owing to the work of the commissioner of internal revenue and 
the Secretary of the Treasury. 

From the United States Treasury Department comes the 
monthly Statement of the Public Debt of the United States. 
The commissioner of internal revenue of the Treasury publishes 
an annual report of income-tax returns, constituting the most 
important source of data regarding income statistics in the 
United States. The annual reports of the comptroller of the 
currency give financial and banking statistics and monetary 
data going back as far as the Civil War, when the national bank- 
ing system began. The comptroller publishes these data in 
an annual report and also several times a year in the Abstract of 
Condition of the National Banks.^ The annual reports of the 
director of the mint contain statistics on the production of the 
precious metals^ including gold and silver. The Life Saving 
Service of the United States Treasury Department publishes 
data on marine accidents. 

Interior Department, The Department of the Interior has 
important statistical aspects, too. The Bureau of Mines pub- 
lishes data on fatalities in coal mines. The Geological Survey 
publishes data on metal statistics and minerals. In the census 
years it has authority to collect statistics from primary sources. 
Since 1880 it has collected statistics carefully as to the crude 

■k 

^ For other statistics published by the Department of Labor see the United 
States Government Manual and see also Bureau of Labor Statistics, Selected 
List of Publications of the Bureau of Labor Statistics (1939), which can be 
purchased from the Government Printing Office. 

* See page 82 on the Federal Reserve System. 



80 


INTRODUCTION 


oil lifted from the ground, iron ore, etc., watching the physical 
consumption of our natural wealth. It also collects and pub- 
lishes statistics on electrical power production which are^now 
considered useful in the study of the general trend of business, 
so important to business is the use of electricity. Other bureaus 
in the Department of the Interior are the Bureau of Education, 
Bureau of Pensions, and the Bureau of Indian Affairs, each 
publishing certain specialized statistics indicated by their titles. 

Department of Agriculture. The Department of Agriculture 
was not founded until 1862, but statistical work relating to agri- 
culture of a more or less systematic nature dates back to 1839, 
when Congress appropriated $1,000 out of the patent fund, to be 
expended under direction of the commissioner of patents, “in 
the collection of agricultural statistics, and for other agricultural 
purposes.’’ At the present time the great bulk of Federal 
statistics on agricultural matters is collected and published by 
the Bureau of Agricultural Economics, which originally was the 
Bureau of Statistics in the Department of Agriculture and later 
was known as the Bureau of Markets and Crop Estimates. In 
addition to a host of bulletins on special subjects related to 
agriculture, this bureau publishes a monthly report on weather 
conditions, Crops and Markets, and gives out estimates of annual 
crop yields. In recent years it has become the source of pioneer 
statistical work in the measurement of the factors influencing the 
demand for agricultural products and other similar statistical 
studies in connection with the conduct of the Agricultural Adjust- 
ment Administration. The agricultural yearbook, published by 
this Department, is a valuable record of agricultural progress in 
the United States and contains also extensive summaries of 
agricultural statistics. Since 1936 these summaries have been 
published separately under the title Agricultural Statistics. 
Current agricultural data are disseminated by the Department 
of Agriculture in its monthly publication, the Agricultural 
Situation. The Bureau of Agricultural Economics, which has 
direct charge of the above publications, also furnishes part of 
the program for the Farm and Home Hour on the radio, designed 
to distribute timely agricultural information to the farming 
population of the nation. 

The administrative departments of the government thus con- 
stitute sources of statistics on a large scale, and statisticians 
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continually make use of these Federal sources of statistics. 
These publications of the government are available to everyone at 
very low cost and can be found for free use in most large libraries 
of the country or at offices maintained for the purpose by the 
government. 

The Independent Establishments, In addition to the adminis- 
trative departments of the national government there are many 
national commissions or boards or agencies, collectively described 
as the ^^independent establishments’’ of the government. Some 
of these have become well-known sources of statistical data in 
special fields. The principal ones are the Interstate Commerce 
Commission, the Federal Trade Commission, the Federal Security 
Agency, the Federal Power Commission, the Federal Deposit, 
Insurance Corporation, the Securities and Exchange Commission, 
the Tariff Commission, the Maritime Commission, and the Board 
of Governors of the Federal Reserve System. 

The Interstate Commerce Commission was created in 1887 
as the Federal government’s solution of the railroad problem, 
following detailed Congressional reports of the situation, known 
as the Windom Report (1873-1874) and the Cullom Report 
(1886). These reports may be said to be the beginning of Federal 
railroad transportation and communication statistics. Since 
1887, such statistics have been gathered and published by the 
Interstate Commerce Commission, its powers having been gradu- 
ally extended to include other types of transportation, oil pipe 
lines, and express companies. In 1934 Congress created the 
Federal Communications Commission, which is devoted primarily 
to telephone, teRgraph, cable, and radio. 

The Federal Trade Commission is the Federal source of data on 
the monopoly problem. In 1890 the Sherman Antitrust Act 
was passed; and in 1903 Congress realized that there was need to 
collect facts to be used as a basis for the enforcement of the 
Sherman Act. At the urgent request of President Roosevelt, 
Congress created the Bureau of Corporations for the purpose of 
gathering data that would aid in the proper enforcement. Fol- 
lowing the passage of the Federal Trade Commission Act of 1914, 
the Bureau of Corporations was merged with the Commission. 
This Commission publishes reports on its investigations of various 
trusts, such as the investigation of coal, cotton, cereals, meat 
packing, and a number of others. During the 1920’s and 1930’s 
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it was a collector and publisher of statistics concerning trade 
associations and trade practices. 

The Board of Governors of the Federal Reserve System, which 
has operated since 1913, has become the greatest national source 
of statistics on banking and financial subjects. It publishes an 
annual report containing statistics on banking and related sub- 
jects, the Member Bank Call Report several times a year, and the 
Federal Reserve Bulletin^ a monthly publication invaluable to 
bankers and statisticians working in banking subjects. In addi- 
tion, it publishes weekly mimeographed press releases on the 
condition of Federal reserve banks and of reporting member banks 
in order to make available more current data than is possible 
with the monthly or annual publications. In addition to financial 
and banking statistics the Board also has constructed through 
its Division of Research and Statistics an index of production 
calculated upon a comprehensive basis; this index and other 
special studies are also published in the annual reports and in the 
Federal Reserve Bulletin. 

The United States Tariff Commission, created in 1916, gathers 
statistics purporting to aid in the administration of the tariff 
laws and to help determine when duties should be raised or 
lowered. Owing to the strong influence of politics upon the 
question of the tariff, the studies of the Tariff Commission, with 
certain notable exceptions, constitute a great source of misuse 
of statistics. This was particularly true for the period from 
1920 to 1932 w^hen most of its studies were for the purpose of prov- 
ing the need to raise tariffs. After the passage of the Reciprocal 
Trade Agreements Act in 1934 extensive imJ)rovements w^ere 
inaugurated, and additional data were made available with the 
numerous studies that were conducted in cooperation with the 
State and other governmental departments. 

Finally, in connection with Federal statistics, it should be 
mentioned that frequently Congressional investigations result in 
the assembly and publication of valuable statistical material often 
constituting original sources or at least original compilations of 
•such material. Mention has already been made of the Windom 
Report in 1873-1874 and the Cullom Report in 1886, both on 
transportation, which led to the creation of the Interstate Com- 
merce Commission in 1887. Other examples are the Pujo 
Money Trust Report of 1913 and the various reports of the Senate 
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and House Committees on Banking and Currency during the 
1930^s on brokers' loans, branch banks, the operation of the 
national and Federal reserve banking systems, foreign loans, and 
stock-exchange practices. Important Federal legislation of that 
decade Avas based on these investigations. 

Several noteworthy special commissions, created by Congress 
from time to time, have produced published documents that have 
become famous as great sources of primary statistical information. 
The Aldrich Reports from the Senate Committee on Finance, 
on Retail Prices and Wages (1892) and Wholesale Prices, Wages, 
and Transportation (1893) constitute extensive compilations of 
price data covering a period of over fifty years. These reports 
have been extensively used as source material for statistical 
studies of prices and wages for the period 1850 to 1900. 

The Industrial Commission created by act of Congress of 
June 18, 1898, submitted a report to Congress in 1902, consisting 
of 19 volumes and presenting a substantially complete epitome of 
the industrial life of the nation and of the important changes in 
business methods that occurred in the latter part of the nineteenth 
century. These volumes arc largely statistical in their methods 
of description. The Immigration Commission, created in 1907, 
presented to Congress in 42 volumes a full inquiry into the sub- 
ject of immigration, reviewing statistically immigration to the 
United States during the period 1820 to 1910 and the com- 
ponent elements in our population as determined by immigration 
from 1850 to 1900. The National Monetary Commission, 
created in 1908, studied the banking and currency systems of 
the United Statics as compared with those of other countries. 
This Commission collected more complete statistical information 
with regard to the banks of foreign countries such as Great 
Britain, France, and Germany than had ever been collected 
before and for the first time in this country obtained compa- 
rable statistics for all banks in the United States. The full 
report of the Commission, consisting of 24 volumes, was com- 
pleted in 1912 and served as the basis of the bank-reform 
legislation known as the Federal Reserve Act. 

Other similar statistical studies in various fields of economic 
^and social life have been made by commissions, such as those of 
.the Select Committee on Wages and Prices established in 1910, 
nlfie Commission on Industrial Relations created by an act of 



84 


INTRODUCTION 


1912, and the Commission on National Grants to Vocational 
Education. The Hoover Committees on Social Trends (1933) 
published extensive studies, partly statistical in character, of the 
economic and social life of the nation. 

One of the most notable of such temporary organizations was 
the National Resources Planning Board, established in the 
executive oflBice of the President of the United States under 
authority of the Reorganization Act of 1939. This Board 
succeeded the National Resources Committee, which had been 
established in 1935. Earlier names of the same organization 
were National Resources Board and Advisory Committee and 
National Resources Board, which was created in 1934 to succeed 
the planning organization of the Federal Emergency Administra- 
tion of Public Works. When the United States Congress dis- 
covered what it felt was an attempt by the executive to usurp 
Congressional powers by having an economic planning board, 
it. became hostile to the National Resources Planning Board. 
This hostility was not diminished when in 1943 the Board pre- 
sented to the executive a plan for the postwar expansion of the 
Federal security program. President Roosevelt handed the 
report over to Congress for action, but the Board was abolished 
in that year when Congress refused to vote funds for its con- 
tinued existence. During the course of its checkered career, 
however, the Board became the author of several noteworthy 
statistical publications: Energy Resources and National Policy 
(1939), The Problems of a Changing Population (1938), Consumer 
Incomes in the United States (1938), Consumer Expenditures 
in the United States (1939), and The Structure of the American 
Economy (1939). 

State and Municipal Sources. The activities of the various 
state governments result also in the compilation and publication 
of statistics. Most states maintain departments of institutions 
and agencies that, through supervision of reform schools, prisons, 
hospitals, and the like, become sources of statistics on mental 
and physical pathology, as well as delinquency. Data concern- 
ing the records of penal and charitable institutions, hospitals, 
and asylums for the insane and feeble-minded are primarily 
recorded by state or by municipal organizations. 

Vital statistics, that is, data relating to births and deaths and 
the classification of deaths by causes, have become an important 
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part of the demographic work of municipalities and states and 
have thus made the state and municipal governments important 
primary sources of data of this character. In addition, statistics 
on marriage and divorce are recorded through state and munici- 
pal licensing administration. 

Data are recorded by states and regularly reported, based on 
their tax-collecting, licensing, and registration responsibilities. 
For example, statistical data result from automobile registration 
by states. 

State incorporation laws result in the accumulation of data. 
State incorporated banks and trust companies and building and 
loan associations, for example, are all regulated by the banking 
departments of the various statcis, and statistics regarding these 
institutions are regularly compiled and published by these 
departments. Similarly, life insurance, fire insurance, automo- 
bile and casualty insurance, and workmon\s compensation laws 
and social-security laws have resulted in state-regulating bodies 
and the compilation and publication of statistical data on 
financial, commercial, and industrial subjects. 

A number of the larger and older of the industrial states have 
highly efficient labor departments, which compile and publish 
statistics of industrial conditions. Of increasing importance and 
interest to social scientists is the development of the volume of 
statistics relating to industrial accidents and diseases, growing 
out of the need for such statistics in the administration of the 
workmen’s compensation laws. 

The regulation of public utilities and water companies and 
street-railway apd bus companies by state and municipal authori- 
ties has made the public-utility commissions of the states the 
principal primary sources of statistical data on these important 
industries, although in the 1930’s many of these data were 
gathered by the Federal Power Commission and the Security 
and Exchange Commission. 

WORLD STATISTICS 

Under the League of Nations progress has been made in the 
collection and publication of world statistics. These are pub- 
lished in the Monthly Bulletin of Statistics of the League of 
Nations and also in its International Statistical Yearbook and 
its annual World Economic Survey, Statistics on world com- 
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mercial banking and finance were published in special League 
publications. Previous to the work of the League of Nations 
in this respect, the World Almanac had for many years been 
highly valued as a rough-and-ready source of a variety of \/orld 
statistics and still constitutes a popular source. 

The Statesman's Yearbook, published by Macmillan & Com- 
pany, Ltd., London, is a statistical and historical annual of the 
states of the world, giving data on population, area, finance, 
commerce, and banking, as well as figures on the fleets of the 
world and the world^s shipping. It has been issued annually 
since 1864. The United States government has always shown 
considerable interest in statistics of foreign countries and has 
published them along with the domestic data; but this practice 
has been far more systematic and thorough since the First World 
War. For example, the Federal Reserve Bulletin regularly 
publishes statistics of prices, banking, and currency conditions 
in the principal nations of the world; foreign price statistics are 
published by the Bureau of Labor Statistics in its special bulletins ; 
and statistics on trade between other countries, that is, the trade 
of the world outside the United States and not with the United 
States, are published by the Department of Commerce in Vol. 2of 
the Commerce Yearbook (as well as the statistics of our own foreign 
trade). In 1938 the Paris International Chamber of Commerce 
published a brochure on the economic statistics in 26 countries. 

In addition to such collections of statistics for all or a majority 
of the countries of the world, mention should be made of the 
sources, in greater detail than the world volumes, for statistics 
concerning three of the important countries of Europe. For 
England and the Dominions, there is the Statistical Abstract for 
the British Empire, published by the Board of Trade. This 
combines what was previously published in the Statistical 
Abstract for the United Kingdom (first issued in 1864 for the years 
1840-1853) and the Statistical Abstract for the Several British Over- 
sea Dominions and Protectorates (first issued in 1864 for the years 
1850-1863). The French government publishes Annuaire statis- 
ti^ue (1878) and the Bulletin de la statistique ginerale (1911). In 
Germany the official source of statistics isiheStatistischesJahrbuch 
fiir das deutsche Reich (1880). 

It has long been recognized that international statistics would 
be extremely important in obtaining true international, political, 
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and economic understanding and cooperation. Consequently, 
for many decades, efforts have been made to arrive at some sort 
of international understanding on methods to make the com- 
pilation of international statistics feasible or at least to improve 
existing world statistics. The statistics of each country are 
gathered according to the needs of that country; and since the 
problems in respective countries differ, so do the statistics. 
Their compilation and classification, according to varying 
definitions of units and varying bases of classification, produce 
startling differences in the final results. Then, too, the economic 
organizations of the various countries are different. A country 
with a large amount of transit trade and heavy reexportation 
of goods imported needs a different sort of classification of foreign 
trade statistics than a country doing little reexport business. 
Furthermore, the statistics themselves are gathered and organized 
in diverse ways in the various countries; the methods of collecting 
the statistical raw materials, the periods for which these data are 
gathered, and the methods of classification are not the same in 
the various countries. 

The endeavors made in the last eighty years for better inter- 
national statistical information, therefore, were first concentrated 
on the problem of rendering national statistics more comparable, 
since national statistics must be comparable between the various 
nations before they can be added up or compared to obtain 
international or world statistics. Quetelet, the Belgian who did 
so much to organize comparable international astronomical 
observations, was likewise the first to try to solve the problem 
of obtaining the, fundamental basis for better world and inter- 
national statistics. It is principally due to him that the First 
International Statistical Congress was organized in 1853 in 
Brussels. The main purpose of this Congress, the members of 
which attended in their private and not in their official capacity 
(although some were officials), was to bring about some degree 
of comparability in national statistics between the various 
nations. 

Another attempt to obtain international cooperation in 
statistical work was made in 1887 when the International Statis- 
tical Institute was formed. This organization, still in existence, 
elects members who are active in statistical work as professors, 
government officials, or members of private statistical offices. 
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The Institute cannot bind its members or the national govern- 
ments of its members but makes progress by suggesting improve- 
ments to different countries. 

The first official or semiofficial attempts for better World 
statistics were made in 1875 through the establishment of the 
International Bureau of the Universal Postal Union and the 
Bureau of the International Telecommunication Union (origi- 
nally called the International Bureau of the Telegraph Union). 
Both regularly gather statistics on postal and telegraphic develop- 
ments. Similar efforts in another field wore made for the first 
time in 1882, by the International Congress for Hygiene and 
Demography. In 1905, another significant official attempt was 
made for greater comparability in world statistics. In that year, 
at the suggestion of the United States gove^rnment, a meeting 
was held in Rome to formulate some plan for obtaining uniform- 
ity of agricultural statistics. This meeting led to the founding 
of the International Agricultural Institute, which still is active 
in the gathering of world statistics on agriculture, production, 
consumption, prices, and trade. The statistical information 
assembled by this body is published monthly and yearly and 
special publications are also issued. Sixty-two different coun- 
tries are members of the Institute. The Institute was very 
successful in putting national agricultural statistics on an 
internationally more comparable basis and in assembling regu- 
larly good and reliable world statistics on all fields of agriculture. 

Since the First World War, the League of Nations has been 
the natural organization to proceed with the work of interna- 
tionalizing statistics. Shortly after its establishment, the League 
started that work. At the International Economic Conference 
of 1927 the problem of comparable national statistics in order to 
secure good world statistics was studied. The League of Nations 
subsequently brought about an official meeting on the subject 
of international statistics and called an International Statistical , 
Conference to meet in Geneva in November, 1928. The keynote 
of the Conference was that the general adoption of comparable 
international statistics was desirable for good international 
policies and in the interests of permanent world peace. The aim 
of the Conference was to bring about the broadening of the scope 
of national statistics .in all countries where it seemed to be needed 
and to attempt to make national statistics in different countries 
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comparable. The Conference emphasized once more that such 
attempts meet with many difficulties. Of the 42 countries repre- 
sented (some nonmembers of the League, like the United States, 
were*also represented), only 29 countries felt they could sign the 
Convention and Protocol of the Conference. To induce that 
number to sign, it was necessary to limit greatly the program of 
work. 

Nevertheless, the Conference of 1928 did produce good results. 
A number of points were discussed, and important conclusions 
were reached. In addition, the Conference created a committee 
of technical experts to rnciet from time to time and make sugges- 
tions for fiirthcM* progress. This group met in March, 1931, and 
formulated a constitution for future work. It met again in 
December, 1933, to discuss problems of statistics on foreign 
trade. Up to the present time, its contribution to the solution 
of the problems involved has been inconsiderable, but it may 
make advances in this important work if the countries concerned 
will be willing to carry out the recommendations made by it, as 
they are apparently committed to do l^y the Convention and 
Protocol of the Conference of 1928. 

In 1930 the twenty-third session of the International Institute 
of Statistics was held at Athens. At that session there were 
75 members, of which 10 were-from North America. Twenty- 
seven countries dc^signated official deh^gates. Mso, the Secretary 
of the League of Nations, the International Labor Office, the 
International Institutii of Intellectual Cooperation, the Inter- 
national Institute of Agriculture, and the International Chamber 
of Commerce wt;re represented.^ 

In May, 1940, one of the 11 sections of the Eighth American 
Scientific Congress convened by the government of the United 
States in connection with the observance of the fiftieth anniver- 
sary of the founding of the Pan American Union was devoted to 
statistics. The program of the section had the following broad 
objectives: (1) improvements in the comparability of official 

^ Stuart, Prof. C. A. Vkrijn, XXIIIeme session do Tinstitu^ 
international de statistique, Ath^nes, 193(5,” Revue de Vimtitut international 
de statistique j vol. 4 (1936), pp. 367-403. The citation includes the summary 
of resolutions of the session (pp. 378-395) and communications from various 
delegations on methods, legislation, organization, and administration of 
statistics (pp. 396-403). 
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statistics among the American nations; (2) improvements in 
statistical methodology; (3) the furtherance of acquaintance 
among the statisticians of the American continent; (4) con- 
sideration by these statisticians of the possible development of a 
continuing professional medium for the interchange of statistical 
ideas and information. Correspondents in several of the 
American nations had pointed to the need for closer profes- 
sional collaboration among the statisticians of this hemisphere, 
and it was proposed to explore at this meeting the possibilities 
of establishing some kind of an inter- American statistical organi- 
zation of professional character. The result was the formation 
of the Inter-American Statistical Institute. 

A new quarterly, the Estadistica, published in Mexico, is the 
official organ of the Inter-American Statistical Institute, con- 
stituting one of its mediums for fostering statistical development 
in the Western Hemisphere. It endeavors to acquaint the 
persons in one country with statistical developments in other 
countries, to inform its readers concerning the availability of 
data, to present articles that will tend to encourage the adoption 
of improved methods, and hence to improve the (luality of data. 
Articles may appear in any of the following four languages: 
Spanish, English, Portuguese, or French. An author’s sum- 
mary accompanies each article; -the summary is reproduced in 
several languages. The Inter-American Statistical Institute 
also publishes a yearbook of statistics including statistical data 
for Latin- American countries and North America. 

Prospects to secure comparable world statistics and for inter- 
national statistics fluctuate with the rise and fall of isolationism 
and nationalism. Under the League of Nations and under the 
Pan American Union progress has been encouraged, only to be 
hampered by ever-persistent isolationism in one country or 
another. Nevertheless, the need for comparable data with 
respect to all nations of the world has become more and more 
evident, it has come to be more and more appreciated as the 
problems have been studied by these various institutes, con- 
ferences, and committees, and more and more is it coming to 
be realized that such statistics are a pressing necessity to busi- 
nessmen with interests spread far and wide over the inter- 
national field. 
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While it has been stressed in this section that there are as yet 
no truly comparable international statistics, the student of 
inteijiational affairs and the international businessman vdll be 
able to obtain what constitutes for the present the closest 
approximation to them from a number of sources, chief among 
them the following: (1) International Statistical Yearbook (pub- 
lished by the League of Nations); (2) Vol. 2 of the Commerce 
Yearbook (published by the United States Department of 
Commerce); (3) The International Appendix to the Statistics 
Yearbook of Germany (Statistisches Jahrbuch fiir das deutsche 
Reich); (4) the Statesmards Yearbook. The World Peace Founda- 
tion publishes also a subject index to the economic and financial, 
documents of the League of Nations. 



CHAPTER IV 


PRESENTATION OF STATISTICS 
TABLES 

Principles of Tabulation. Tabulation is the mechanical part 
of classification. Its function is so to arrange the physical pres- 
*entation of quantitative facts that there can be no misinter- 
pretation of their significance. The attainment of this object 
depends upon the following principles : 

1. Concise, clear, and complete titles attached to the table. 
Usually the title is placed at the top, above the table, but it is 
sometimes placed at the bottom. The function of the title is to 
give a general description of the contents of the table. 

2. Careful, unambiguous description of the units of measure- 
ment or presentation used in the collection and recording of the 
data. This is ordinarily placed immediately under the title. 
Subheadings frequently require definition of units. 

3. The arrangement of the data in columns and rows accord- 
ing to. a clearly indicated basis for classification. 

4. The exact description of columns and rows by the use of 
caption headings and stub headings. 

5. Footnotes to clarify headings or subtitles or to specify 
limitations of particular figures. 

The scheme shown on page 93 gives an abstraction of the 
mechanics of tabulation. It shows the position of the title and 
the description of units above the table and for illustration 
designates four columns, numbered (1), (2), (3), and (4), and 
three rows, lettered (x), (y), and (z). 

The four columns are subcolumns — (1) and (2) are subcolumns 
of column (a), and (3) and (4) are subcolumns of column (b). 
The caption headings would appear in the spaces designated 
(a) and (6), respectively; and subcaption headings would appear 
in the spaces designated (1), (2), (3), and (4). Similarly, the 
three rows are described by stub headings appearing in (rc), (y), 
and (j). The space (D) is for the general description of the stub 

• 92 
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headings. It is possible also to have stub subheadings. In 
order to illustrate further, there is reproduced in Table 1 on 
page 94 data compiled from the replies to the questionnaire 
shown on pages 46-47. 


Title 

(Doscription of units) 


(D) 

(a) 

(6) 

(1) ; (2) 

(3) 

(4) 

{x) 

i 

1 1 


(y) 

I 



U) 


! 

1 

! 


General-purpose and Special-purpose Tables. A mere glance 
at the specimen taken from the publication of the United States 
Department of Agriculture is sufficient to lead to the conviction 
that such tables are not meant for light reading. They are 
essentially reference tables^ or general-purpose tables. The prin- 
cipal guide in the construction of general-fiurpose tables is to 
include as much as possible in as small a space as possible, con- 
sistent with presentation of the amount of information deemed 
necessary. Thus the tables contained in such publications as 
the United States Census reports or the Federal Reserve Bulletin 
or the Survey of Current Business may not constitute popular 
reading; but they are a great boon to all who seek ready access 
to details, arranged in a manner so facilitating their discovery 
by the careful observer that looking up a particular figure is 
almost as easy as looking up a word in the dictionary. 

When a table is to be read — is to tell a story — it is called a 
special-purpose table. Such a table should have as its out- 
standing characteristic the quality of simplicity. It should not 
try to tell too much at once; if necessary, more than one table 
may be used for telling a more complex story. Special-purpose 
tables should have a great deal of white space in and around 
them to make lazy readers (and most people are lazy when tt 
comes to reading tables of figures) think them easy to read. 
The type or print should be sufficiently large for easy reading. 
The reader should be adequately prepared or oriented to the 





Table 1. — Sources of Family Income: Number of Families Receiving Income from Specified Sources, Nubcber 
Having Business Losses, Average Amount of Income Derived from Specified Sources, and Average 
Amount of Business Losses, by Income, Pacific Small Cities Combined, 1935-1936 
(White nonrelief families that include a husband and wife, both native-born) 
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Pacific Region, Part I, Family Income, Miscdlaneoua Publication . 3 ^ 9 , p. 22. 
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table by the text accompanying it and particularly by the title 
of the table. Briefly, the story of the table should be told in 
literary form in the text, reliance being placed on the table 

Table 2. — Average Disbursements of Consumer Units^ in Each Third 
OF Nation, 1935-1936 



Average disbursements 
of families and single 
individuals in 

Percentage of income 

Category of disbxirscment 

Lower 

third, 

incomes 

under 

$780 

Middle 

third, 

incomes^ 

of 

$780- 

$1,450 

Upper 

third, 

incomes 

of 

$1,450 

and 

over 

] 

Lower 

third 

Middle 

third 

Upper 

third 

Current consumption: 

Food 

$236 

$ 404 

$ 642 

50.2 

37.5 

21.7 

Housing 

115 

199 

408 

24.4 

18.5 

13.8 

Household operation 

54 

108 

240 

11.4 

10.0 

. 8.1 

Clothing 

47 

102 

251 

10.0 

9.5 

8.5 

Automobile 

16 

57 

215 

3.3 

5.3 

7.2 

Medical care 

20 

41 

106 

4.S 

3.9 

3.6 

Recreation 

9 

28 

89 

1 .8 

2.6 

3.0 

Furnishings 

9 

28 

72 

1.8 

2.6 

2.4 

Personal care 

12 

22 

44 

2.5 

2.1 

1.5 

Tobacco 

10 

23 

40 

2.2 

2.1 

1.4 

Transportation other than 
auto 

11 

19 

37 

2.4 

1.7 

1.3 

Reading 

6 

12 

23 

1.3 

1.2 

0.8 

Education 

2 

7 

30 

0.5 

0.6 

1.0 

Other items 

3 

6 


0.6 

0.5 

0.5 

All consumption items. . . . 

$550 

*l,056i*2,212 

116.7 

98.1 

74.8 

Gifts and personal taxes^ 

$ 13 

$ 39$ 181 

2.8 

1 3.71 6.1 

Savings 

-92 

1 -19 

, 566 

-19.5 

-1.8 

19.1 

All items 

$471 

$1,076 $2, 959 

"iooTo 

100.0 

j 100.0 


1 Inclutlcs fill families and single individuals, but excludes residents in institutional groups, 

2 "Taxes shown here include only personal income taxes, poll taxes, and certain personal 
property taxes. 

Source: National Resources Committee, Consumer Expenditures in the United States, 
Estimates for 1935-3G (1939), p. 40. 

merely as a dramatic summary. Simple devices to aid inter- 
pretation and facilitate the mental vision of the table have a 
useful place in special-purpose tables, such as accompanying 
relative figures, methods of emphasis such as italics, or the 
scheme of ruling the table. 
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The object of a special-purpose table may also be to compress 
into a small space a body of information “the narration of which 
in the text would be cumbersome and exhausting to the reader. 


Table 3. — Share of Each Third of Nation’s Consumer Units^ in 
Aggregate Disbursements, 1935-1936 



Aggregate disbursements, 
millions 

Percentage of aggregate 
disbursement for each 
category made by 

Category of disbursement 

Lower 

third, 

incomes 

under 

$780 

Middle 
thinl, 
incomes 
of $780- 
$1 .450 

Upper 
third, 
incomes 
of $1,450 
and over 

Lower 

third 

Middle 

third 

UpF)er 

third 

Current consumption: 

Food 

S3, 108 

$ 5,310 

$ 8,447 

18.4 

31.5 

50.1 

Housing 

1,515 

2,621 

5,370 

15.9 

27.6 

56.5 

Household operation .... 

703 

1,422 

3,160 

13.3 

26.9 

59.8 

Clothing 

618 

1,338 

3,305 

11.7 

25.5 

62.8 

Automobile 

203 

755 

2,823 

5.4 

20.0 

74.6 

Medical care 

264 

546 

1,395 

12.0 

24.7 

63.3 

Recreation 

115 

362 

1,166 

7.0 

22.0 

71.0 

Furnishings 

112 

368 

942 

7.9 

25.9 

66.2 

Personal care 

155 

292 

585 

15.1 

28.2 

56.7 

Tobacco 

134 

301 

531 

13.8 

31.2 

55.0 

Transportation other 
than auto 

150 

247 

487 

17.0 

27.9 

55.1 

Reading 

84 

165 

302 

15.3 

29.9 

54.8 

Education 

30 

87 

389 

5.9 

17.2 

76.9 

Other items 

35 

76 

196 

11.4 

24.6 

64.0 

All consumption items . 

$7,226 

$13,890' 

$29,098 

14.4 

27.7 

579 

Gifts and personal taxes^. . . 

$ 17l'$ 516$ 2,3801 

5.6 

16.8 

77.6 

Savings 

-1,207 

-2521 

7,437| 

-20.2 

-4.2 

124.4 

All items 

$6,190j$l4,154j$38,915 

1 10.4 

23.9 

1 65.7 


1 Includes all families and single individuals, but excludes residents in institutional groups, 
* Taxes shown here include only personal income taxes, poll ta::es, and certain personal 
property taxes. 

Source; National Resources Committee, Consumer Expenditures in the United States, 
Estimates for 1935-36 (1939), p. 61. 

It is, in short, a method of condensation, and it is of the utmost 
importance that, as it tells so much in so small a compass, it 
tell it as clearly as practicable/'^ 

^ Falkneb, Roland P., ^‘Statistical Tabulation and Practice,’^ Journal 
of the American Statistical Association, vol. 11 (1916), pp. 192-200. 
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Tables 2 to 6 are examples of special-purpose tables. They 
tell stories that are more or less hidden in the detailed but well- 


Tabl^ 4. — Percentage Dlstrirutions op Nonreliep Families^ in Six 
Types of Community, by Income Level, 1935-1936 


Income level 

All 

families 

Families living in 

Urban communities 

Rural communities 

Metroi)- 

olises,2 

1,. 500, 000 
popula- 
tion 
and 
over 

Large 
cities, 
100,000- 
1 ,.500,0(X) 
^jpula- 
tion 

Middle- 

sized 

citiec, 

2.5,000- 

100,000 

popula- 

tion 

Small 
cities, 
2,.500 ■ 
2.5,000 
popiila- 
tion 

Non- 

farm* 

Farm 

Under $ 250 .... 

2.8 

1.7 

2.0 

2.4 

3.1 

3.0 

3.8 

$250 $500 

7.8 

2.8 

4.4 

5.5 

6.3 

8.9 

13.9 

$500 $750 

11.3 

5.2 

7.6 

9.4 

10.3 

11.8 

18.0 

$ 750 -$ 1 , 000 . . . 

13.4 

8.5 

10.5 

13.6 

13.9 

14.4 

16.6 

$ 1 , 000 -$ 1 , 250 . . 

13.2 

10.9 

12.4 

13.9 

14.6 

14.0 

12.8 

$ 1 , 250 - $ 1 , 500 .. 

10.8 

11.0 

10.6 

11.6 

11.1 

11.6 

9.8 

$ 1 , 500 -$!, 750 .. 

9.1 

10.8 

10.0 

9.7 

9.4 

9.1 

7.0 

$ l , 750 -$ 2 , 000 .. 

7.3 

9.7 

9.0 

8.5 

7.8 

6.5 

4.8 

$ 2 , 000 -$ 2 , 250 . . 

5.5 

7.9 

6.9 

6.1 

5.8 

5.1 

3.1 

S 2 , 250 -$ 2 , 500 . . 

4.0 

5.8 

5.5 

4.5 

4.0 

3.4 

2.5 

$ 2 , 500 -$ 3 , 000 . . 

5.2 

■' 8.5 

7.1 

5.4 

5.3 

4.4 

2.9 

$ 3 , 00 a -$ 3 , 500 . . 

3.0 

4.7 

4.2 

3.1 

3.1 

2.3 

1.6 

$ 3 , 500 ~$ 4 , 000 . . 

l.S 

2.9 

2 7 

1.7 

1.7 

1.3 

1.0 

$ 4,000 $ 4 , 500 . . 

1.0 

1.7 

1.6 

1.0 

0.8 

0.8 

0.5 

$ 4 , 50 a -$ 5 , 000 . . 

0.6 

0.9 

0.9 

0.7 

0.5 

0.6 

0.3 

$ 5 , 000 -$ 7 , 500 . . 

1.3 

2.1 

1.8 

1.3 

1.1 

1.4 

0.6 

$ 7 , 500 -$l 0,000 

0.8 

1.6 

l.l 

0.6 

0.6 

0.6 

0.4 

$ 10,000 and 








over 

1.1 

' 3.3 

1.7 

1.0 

0.6 

0.8 

0.4 

All levels .... 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 


1 Excludes all families receiving any tlirect or work relief (however little) at any time 
during year. 

* Metropolises of this size are in North C'entral Region only (New York, Chicago, Phila- 
delphia, and ’Detroit). 

• Includes families living in communities with population under 2,500, and families living 

in the open country but not on farms. * 

Source: National Resources Committee, Consumer Incomes in the United States, Their 
Distribution in 1935-30 (1938), pp. 24-25. 

organized statistics collected by means of the questionnaire 
referred to above. In order to simplify the data for presentation, 
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1 Taxes shown here include only personal income taxes, poll taxes, and certain personal property taxes. 

Source: National Resources Committee, Consumer Expenditures in the United States^ Estimates for 1935-36 (1939), p. 20. 
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Source: Natfonal Resources Committee, Consumer Expenditures in the United States, Estimates for 1935-36 (1939), p. 23, 
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income levels are divided into three groups, lower third, middle 
third, and upper third. These tables illustrate also the use of 
percentage figures to facilitate their interpretation. 

CHARTS 

Quick visualization of many rather complex situations can 
be readily achieved by merely looking at a simple chart. It is 
said that nowadays the first step toward using a series of data 
for any sort of analysis is to represent the figures by a line drawn 
on a chart. So useful is the chart in giving a quick grasp of the 



1940 1941 1942 1943 

Fio. 11. — Federal expenditures for war activities. {From data published in Daily 
Statement of the United States Treasury Department.) 

characteristics of data that it has been adopted in many popular 
books, in magazines, and in the financial section of metropolitan 
newspapers. Figures 11 and 12 illustrate dramatically the 
manner in which charts are used to aid in visualizing important 
developments during wartime. In peacetime the trends of 
data, even though less sensational, are watched with care, and 
charts greatly facilitate their analysis. 

The invention in 1786 of charting is claimed by William Play- 
fair, who set forth its advantages as follows:^ ‘^As the eye is the 

• ^ The Commercial and Political Atlas (3d ed., London, 1801), p. x. Play- 
fair's claim to be actually the first who applied the principles of geometry 
to matters of Finance" is made on pages viii and ix. Cited from W. C. 
Mitchell, Business Cycles--The Problem and Its Setting^ p. 209. In 
Enquiry into the Decline and Fall of Nations Playfair is said to have been the 
first to employ graphical devices in the treatment of sociological discussion. 
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best judge of proportion, being able to estimate it with more 
quickness and accuracy than any other of our organs, it follows, 
tha/. wherever relative (quantities are in question, a gradual 
increase or decrease of any . . . value is to be stated, this mode 
of representing it is peculiarly applicable; it gives a simple, 
accurate, and pei*manent idea, by giving form and shape to a 
number of separate ideas, which are otherwise abstract and 
unconnected/' 



JASONO JFMAMJJASONOJ rMAMJJASONOJFMAMJJASONO 


1940 1941 1942 1943 

Fig. 12. — J'rodiiction of munitions, includinK .ships, planes, tanks, guns, ammu- 
nition, and all field etpiipment. (/Mfa from IVar Production- Board.) 


While the idea underlying the use of charts is quite old, the 
general use of charts for wkle public consumption is of much 
more recent origin and probably owes its present-day popularity 
to inventions having to do with the plating of charts for printing. 
From being largely a hand-labor process, the making of plates 
for the reproduction of charts has come in recent years to be a 
photoelectric process, with the result that today the most 
expensive part of the charts in a book, newspaper, or magazine 
article consists in the mental and hand labor involved in the 
original construction of the chart. 

There are five kinds of charts: (1) pictograms, (2) cartograms, 
(3) frequenc}^ curves, (4) bivariate charts, and (5) curves pictur- 
ing time series. 


William Playfair was, one may say, the Sir William Petty of the Edinburgh 
group . . . .” Eancu'lot T. Hogben, Dangerous Thoughts (1939), p. 283. 
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Pictograms. There are four kinds of pictograms : (1) linear pic- 
tograms, in which the comparison is a linear one; (2) areal 
pictograms, in which the comparison is one of areas; (3) cpbic 
pictograms, in which the comparison is one of cubes, or three- 
dimensional objects; and (4) sectors and circles, in which a circle 
is used to represent a whole and its various sectors are parts 
of the whole. 



Fig. 13. — Distribution of the milk dollar. {Data from The Milk Dollar ^ Milk 
Industry Foundation.) 

The purpose of pictograms is to aid in rapid visualizing of 
coordinate comparisons of magnitudes. For example, a picto- 
gram might represent by a picture of a man the population of the 
United States, accompanied by a picture of proportionately 
smaller men representing, respectively, the populations of 
Prance and Germany. Sometimes pictograms are used to aid 
in visualizing the proportional parts of a whole magnitude, or 
comparison of component parts, as where a dollar is shown divided 
into sectors, representing the way in which the ‘^public dollar'* 
is spent. Figure 13 is an illustration of a pictogram showing the 
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''milk dollar/^ The small obscured piece at the top represents 
2.98 cents of profit for the New York City distributors. 

Areal and cubic comparisons are not frequently used because, 
instead of simplifying the comparison desired, they are likely 



to confuse it. This is because the mind finds difficulty in 
quickly differentiating sizes of areas or of cubes. Figure 14 
shows two areas in the form of squares. One of these areas is 
actually one-half as large as the other; but, at first glance, 
it seems to be more than half as large. Consequently, if com- 



parison of two quantities is desired by charting, areal presenta- 
tion is not a desirable method of obtaining easy comprehension 
of the differences that it is desired to stress. 

The diflSculty is increased if the attempt is made to chart 
differences of magnitude by the use of cubes, for it is still more 
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difficult for the eye and mind to grasp geometric comparative 
magnitudes in three dimensions. This is shown in Fig. 15, 
which depicts two cubes, one of which is one-half as large as the 
other, though a first glance makes it appear to be two-thirds 
as large. For this reason the use of pictures for making corny 
parisons is not considered to be the best practice. For example, 
the presentation for quick visualization of different-sized men 
in uniform to represent the relative fighting strength of various 
countries or of different-sized battleships to repnisent the relative 
size of navies will confuse the interpretation that the eye and 
mind will give to the relative sizes compared, even though the 
relative size is given purely a linear setting in the actual drawing 
of the figures.^ Only the height of the uniformed men may be 
varied, but this might lead to comically proportioned men and 
an illusion of armies of tall thin men vs. armies of short fat nien. 
If the uniformed men are properly proportioned for their varying 
heights, this results in an areal comparison. 

Consequently, the most generally used types of pictograni 
are those involving merely linear comparisons and the use of 
purely abstract linear distances. Rows of soldiers, each soldier 
representing a specified number of men, may be used to advan- 
tage, however, the longer row representing the larger army. 
Similarly, large and small navies can properly be compared by 
rows of ships, each ship representing a specified tonnage of that 
type of warship. Such pictograms are really linear comparisons 
as also are bar charts and sectors of circles. 

^ Bar Charts and Sectors of Circles. The use of bar charts 
y, and sectors of circles is widely practiced and finds its application 
whenever it is desired to compare two or more differing mag- 
nitudes with each other or to give quick visualization of com- 
ponent parts of a given magnitude. Extensive use of vertical 
or horizontal bars is made by the United States Bureau of tiij 
Census in the Statistical Atlas of the United States, one of which 
was issued in 1914 and another in 1924. In addition, many 
modern writings, especially in the fields of the social sciences, 
attempt to portray by charts the statistics it is desired to present 
for popular reading. 

^ C/. Croxton, F. E., and Harold Stein, “Graphic Comparisons by Bars, 
Squares, Circles and Cubes,*' Journal of the American Statistical Association, 
, Vol. 27 (1932), pp. 54-60. 
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Figure 16 is a graphic portrayal of the budget expenditures 
of the Federal government, based upon legislation in effect in 
February, 1943, in which the blacked-out portion of the vertical 
bars reveals in a striking manner the expected increases from 
year to year in expenditures for war activities. 
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TRUST ACCOUNTS 

GOVERNMENT CORPORATIONS ANO AGENCIES J/ 
(NET EXPCNDITUKES) 


OTHER ACTIVITIES i/ 
INTEREST ON PUBLIC DEBT 


TOTAL 

EXPENDITURES, 

GENERAL 

AND 

SPECIAL 

ACCOUNTS 


WAR ACTIVITIES 


1942 


1943 

-Fiscol Yeors- 


1944 


• Transactions in checking accounts. 

2 Includes statutory public debt retirement. 

I'Kt. 10. Budget expenditure.s of the Federal government, based upon legislation 
as of lebruary, 1943. {Tha Budget of the United States Government.) 


The use of horizontal bars is illustrated in Fig. 17, which 
shows graphically the statistical data in Table 4. The differences 
between distribution of income among nonrelief families in 
metropolitan areas as compared with that among families on 
farms is seen at a glance, and a slight scrutiny of the bars brings 
out the less dramatic but clear differences in the distribution of 
income in small cities compared with that in the larger ones. 

Another government publication contains data, shown in 
Table 6, from which charts were drawn that illustrate the use 
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of the component-part bar chart. The data that appear in 
these tables are shown in component bars in Fig. 18, where 
the length of the bar is varied in accordance with the income 


INCOMe 

LEVEL 

UNDER $900 
$ 500* tjOOO 
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S 4,000- 5 jOOO 
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■4 
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LARGE CITIES 
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INCOME 

LEVEL 

UNDER $900 
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$ 1,900-2000 
$ 2 , 000-2000 
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$ 9,000 a OVER 
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0 10 20 
PERCENT OF FAMILIES 
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Fig. 17. — Income distributions of nonrelief families in six types of community, 
1935-1936. (Based on Table 4.) 

level. This makes possible the visual comparison of the average 
total family expenditure at various income levels. For example 
at the income level of $2,000 to $2,500 the aggregate family 
expenditure averages a little over $2,000. At the same time, 
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the amount spent for various purposes can be seen from the 
differently crosshatched parts of each bar. Throughout the 
bars one kind of crosshatching represents a specified kind of 

Income level 

♦15.000 20,000 

10.000 15.000 

5.000 (0,000 

4.000 5.000 

3.000 4.000 

2.500 3.000 

2.000 2.500 

1.500 2.000 

1.000 1.500 

500 1.000 

Under ♦ 500 
Average all levels 

-0.5 0 2 4 6 8 10 12 14 16 18 

Thousands of dollars 

UFood |lHous,n, 

Noie:Tflxes shown here include only personal income taxes, poll taxes, and certain personal property taxes 

Fig. 18. — Use of income by American families at different income levels, illus- 
trating the use of bar diagrams. (Based on Table 6.) 

Income level 

♦ 15.000 20,000 

10.000 15.000 

5.000 10,000 

4.000 5.000 

3.000 4,000 

2.500 3,000 

2.000 2,500 

1.500 2.000 

1.000 1.500 

500 1,000 

Under*500 
Average all levels 

0 20 40 60 60 100 120 140 160 

Per cent of income Negative savings ► 

^Food Sciolhin? □Automobile 

Note '.Taxes shown here include only personal income toxes.poH taxes, and certain personal property taxes 

Fio. 19. — Percentage use of income by American families at different income 
levels, 1935-1930, illustrating the use of 100 per cent bar diagrams. (Based 
on Table 6.) w 

expenditure. The second desirable comparison is still more 
quickly grasped by the use of 100 per cent component part bars, 
which is illustrated in Fig. 19. When such a chart is drawn, 
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it is always advisable to warn readers that 100 per cent bar 
charts are being used; in addition, the table of actual figures 
should be given for the actual figures are completely concealed 
in the relative figures if only the chart is given. It wiH be 
noticed that clever arrangement of crosshatching, placing con- 
trasting types adjacent to each other, aids greatly in the reading 
of the chart. Figures 20 and 21 are interesting uses of the bar 



Fig. 20. — Variation in expenditures with income, illustrating the use of a cross- 
hatched zone diagram. [National Resources Committee, Consumer Expenditures 
in the United States, Estimates 1935-1936 (1939), pp, 105-160.] 

chart, virtually in the form of zones, to show the distribution of 
the consumer food dollar on the assumption of four different 
total national income levels. The same data are shown in 
Fig. 21 in the form of a 100 per cent bar or zone chart. The use 
6f the zone effect has the advantage of aiding the eye to make 
the principal indicated comparisons. 

There are many examples of the use of sectors of circles in 
the Statistical Atlas of the United States^ census of 1920, and a 
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number in the publications of the census of 1930. Figure 22 is an 
example of a single circle divided into sectors representing 
component parts in the utilization of milk in the United States 
in 1929. As in the case of the component bar charts, so also 
in the case of sectors of a circle, it is possible to represent changes 



Fi(J. 21.— Variation in iK‘rrenlap;i*s of various oxpenditiiros with income, 
illustrating the use of a 100 iier cent crosshatched zone diagram. [XationaJ Re- 
sources Committee, Consumer Expenditures in the United States, Estimates 1935- 
1936 (1939), VP- 105-166.] 

from time to time in percentage components by the use of a 
series of circles. It is not advisable to use the sectors and 
circles as bars were used in Fig. 21, namely, to picture relative 
change and total change simultaneously. To do this with 
sectors and circles involves areal, comparisons that are not 
grasped by the readers of the charts. In Fig. 23, which is 
presented to illustrate the use of sectors and circles, the attempt 
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has been made to show also such an areal comparison. While 
most people would see at a glance that the circle for the End of 
1938 is smaller than the circle for the End of 1930, presumably 
to indicate that the total United States long-term investments 
in foreign countries was smaller in 1938 than in 1930, few could 
see from the areal comparison of the circles how much smaller. 



Fio. 22. — Utilization of milk in the United States, 1929, illustrating the use 
of sectors of circles. Based on value. (Fifteenth Census of the United States, 
1930, Vol. 4, Agriculture.) 


Perhaps it is sufficient to have the smaller 1938 circle call atten- 
tion to the fact and then assume that the reader will be led 
thereby to note the figures, which are shown in a separate table. 
But the figure shown in each sector of the circles is a component 
percentage and does not throw light on aggregate amount. 

For the purpose of showing graphically the component parts 
of a total, the split-bar chart is a promising new device. Figure 
24 illustrates its use to show the distribution of the consumer 
food dollar. Comparison between consumer dollars of varying 
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PERCENT OF TOTAL 



END OF 1930 



END OF 1936 


■■ CANADA 6 NEWFOUNDLAND 
F=l EUROPE 

LATIN AMERICA 
BBS REST OF WORLD 


Fig. 23. — The United States’ long-term investments in foreign countries, end 
of 1930 and end of 1938, illustrating use of circles of different sizes. {Bureau of 
Foreign and Domestic Committee, The Balance of International Payments of the 
United States in 1938,” p, 49.) 



Fig. 24. — Distribution of consumer food dollar, 1935, illustrating use of a 
split-bar chart. [National Resources Committee, The Structure of the American 
Economy, Part I, (1939) , p. 68.] 
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purchasing power could be shown by differences in the over-all 
lengths of the bars used. 

Simple bar charts and sectors of circles, it will be noted, do 
not involve areal comparisons to depict component parts; the 
bars and sectors are areas, it is true, but it is not the areas that 
are compared. The comparisons are between the varying lengths 
of the bars, because the bars are of uniform width. Bars of 
varying widths would complicate the comparisons and make 
them areal. Moreover, sectors of circles do not involve areal 
comparisons, because the comparisons visualized are the arcs 
cut on the circle by the angles from the center of the circle. 
The visual comparison is therefore a linear one. Everyone is 
used to estimating the size of a piece of pie. 

Cartograms. As the name indicates, cartograms are maps. 
Generally, outline maps are used and various devices employed to 
picture varying characteristics of different parts of the country. 
All are familiar with the colored maps that show the mountainous 
sections in brown shaded off to the green of the lowlands, the 
light brown in between being the higher plateaus, but not moun- 
tains. The same principle is used in a variety of ways to present 
statistics regarding geographically classi tied characteristics 
of the country by the use of maps. These will be classified and 
briefly described and illustrated. 

Cartograms by Dots, or Points. Dots varying in size for 
different quantities are used in the first class of cartogram of 
this type. Because of the necessity of making areal comparisons, 
that is, of using different-sized circles, this type of cartogram is 
not widely employed and is considered not a good method of 
presenting subjects in cartogram form. An example in Fig. 25, 
however, shows clearly the areas of geographical concentration in 
1935 of wage earners in manufacturing industries of the United 
States. An attempt is made to facilitate the areal comparisons 
involved in this cartogram by supplying a key, in the lower right 
corner of the map. 

In the second kind of cartogram of this type dots of uniform 
size are used, each dot indicating an aggregate specified. When 
dots of this sort are used, they can be counted to figure out the 
total. Sometimes a dot is quarter or half shaded to indicate a 
quarter or half the amount of a full dot. When thus used, the 
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Fig. 26. — Wholesale trade in the United States, 1930, illustrating use of equal-sized dots. 
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Bulletin^' February, 1940, p. 94.) 
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Fig. 28. — Illustrating the use of density crosshatching. [Source: Bureau of Agricultural Economics^ 

Financial Review, VoL 5 (1942), p. 62.] 
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dots should be of sufficient relative size so that there will not be 
too many of them. An example is shown in Fig. 26. 

The chief difficulty in the use of this kind of cartogram is the 
meclianical one of arriving at the proper magnitude to assign 



to each dot of uniform size. If the magnitude assigned to each 
dot is too large, it becomes difficult to show graphically the^ 
small quantities relating to geographical locations where the 
characteristic is scarce. On the other hand, if the magmtude 
assigned to each dot is too small, this results in too great a crowd- 
ing of the dots in areas where the characteristic is very plentiful. 
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In Fig. 26, this is illustrated by the attempt to picture the 
volume of wholesale trade of the state of New York, compared 
with the rest of the country. The dots are so dense that it is 
hardly possible to count their number. While the general 



picture of relative density is quickly visualized from such a map, 
this purpose can be better served by the use of the point-dot 
'map. Another objection to the dot of uniform size map for 
this particular purpose is that it may convey the impression 
that the concentration of wholesale trade is over the whole 
•state of New York, whereas it is known to be concentrated in the 
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CALIFORNIA 

Jwc€t: iSttbbfr Book 


Fig. 32 . — Distribution of rubber manufacturing in three leading states in 1937 , 
illustrating a dramatic use of a point^dot map. [Reproduced from Barker, P. W., 
and E,G. Holt. Rubber Industry of the United Stales, 1938—1939 {Bureau of Foreign 
and Domestic Commerce), p, 20 .] 
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metropolitan area of New York City. This conception is 
more clearly brought out by the use of the device of using large 
dots pf varying size. 

The third type is the point-dot cartogram, in which each dot 
means a certain quantity, but the dots are so small that they 
cannot be conveniently counted. The significance lies in pre- 
senting the idea of relative density of dots. Figure 27 shows the 
concentration in the Southeast and the Northern Middle states 
of nonpar banks of the United States. 

Cartograms by Colors and Shades. Obviously, the same effect 
can be produced by the use of colors and shades as by the use of 
dots, but the former are expensive to reproduce in print and 
therefore are not extensively employed. The Statistical Atlas 
of the eleventh and twelfth censuses of the United States 
contains numerous such cartograms. 

Cartograms by Crosshatching. Making comparisons relating to 
geographical location by crosshatching maps has increased in 
popularity during recent years. It is more effective than the 
method of dots and is cheaper than coloring and shading. 
Figure 28 makes it easy for the reader to visualize the variation 
in different parts of the United States in the proportion of 
mortgaged owner-operated farms paying rates of interest as 
high or higher than 6.5 per cent. Figure 29 shows at a glance 
the variation from state to state in the percentage increase in 
nonagricultural employment from 1940 to 1943. 

Figure 30 is an interesting experiment in the combined use 
of a map and bar chart to show variation in the percentage 
increase in manufacturing employment in various metropolitan 
areas from 1940 to 1943. Figure 31 shows the use of a map and 
bars to depict flow of freight traffic in the United States. In 
Fig. 32 the geographical concentration of the rubber-manu- 
facturing industry in three states of the United States is 
dramatically emphasized by showing outline maps of only those 
three states. 
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STATISTICS— A STUDY OF VARIATION 

Ubiquitous Variability. Only in the abstract sense is there 
such a thing as a fixed quantity; in all cases, with reference both 
to physical and to psychic things, practical quantitative expres- 
sions are variables. However fixed the true quantity may be, 
no human measuring device is capable of giving the exact 
quantity; hence, all measurements obtained are approximations. 
In both physical sciences and social sciences, the raw materials 
amenable to the techniques of statistics are quantitatively 
expressed variations. The methods of analysis are likely to be 
complex when the scientist is faced with complex variability. 
This fact for the social sciences is recognized in the following 
quotation:^ ‘‘The social scientist is limited by the fact that he 
does not deal with rational material but with the rational and 
irrational conduct of man. The host of variables which this 
fact introduces multiplies the obstacles to his work and sets 
limits to the applicability of results. 

USE OF SYMBOLS 

Simplification of the complex methods that need to be used in 
statistics is accomplished by the use of symbols. Because sym- 
bols are used for various purposes, beginners may have a natural 
psychological reaction unfavorable to the study of statistics. 
The uninitiated may be mystified and frightened away from the 
subject on account of the symbolic presentation. It is impor- 
tant, therefore, to realize that the symbols used in statistics are 
quite simple and that there are not very many of them. Fur- 
thermore, they are easily learned and remembered, as soon as 
their real purpose of simplification is understood. 

1 Fosdick, Raymond, B., A Review for 1939 — The Rockefeller Foundation^ 
pp. 41-42. This foundation contributes extensively to the support of 
research in many scientific fields; for example, it contributes to such research 
organizations ' as the Brookings Institution and the National Bureau of 
Economic Research, discussed in Chap. III. 
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The Variable X. The study of variation is the meat and 
bones of the craft. The variable X is not a new idea to anyone 
who^has gone as far as a first course in algebra and who has on 
many occasions said, '‘Let X equal . . . Symbols enter into 
statistical analysis in only three ways: 

1. To represent variation in size with time; in such a case the 
data measuring the variable are designated "time series.” 

2. To represent variation in order of magnitude, from smallest 
to largest, or vice versa (if time is involved, it is disregarded, as 
the variable is rearranged or reclassified upon the basis of mag- 
nitude); in such a case the data measuring the variable are 
designated "frequency series.” 

3. To represent variation in quality or attribute (for example, 
occupation, geographical location, or race). 

In symbolic language, it is purely a matter of convention that 
the variable may be referred to as X or as F or as Z. In a given 
problem, if the nomenclature of X is assigned to a given variable, 
it is necessary to retain that symbol for that particular variable 
throughout the problem. In the theory of statistics conventions 
have arisen as to the use of symbols; for example, variables are 
commonly designated by the letters at the end of the alphabet, 
while constants or known figures are designated by the letters 
at the beginning of the alphabet. 

One convention widely followed is to use a bar over a letter 
to designate the arithmetic mean, so that Xi (read "Xi bar”) is 
the symbol for the mean of a series of X’s. Another group of 
X^s would be X/ and their mean X/. The subscripts i and j, 
respectively, symbolize subgroups. For example, all the X^s 
may refer to the I.Q.’s of college freshmen; X^ refers to the I.Q.^s 
of male freshmen; and Xy refers to the I.Q.^s of female freshmen. 
Accordingly, X» symbolizes the mean I.Q. of male freshmen, and 
Xy symbolizes the mean I.Q. of female freshmen. It is then 
conventional to designate the mean of all the X^s, both X» and 
Xy, as X (called “X bar”). 

Another commonly used convention is to designate an esti- 
mated figure by a letter followed by prime. According to this 
convention if an estimate is of the value of X (for example, 
the coming crop yield of wheat based upon reports to .the United 
States Department of Ag#julture), the estimate is symbolically 
designated X'. Similarly, if an estimate of X (the price of 



124 


INTRODUCTION 


wheat, for example) is made from information on supply and 
demand data, it is called X'. The small Greek letter sigma (<r) 
is used to designate standard deviation.^ A special estimate 
of the standard deviation is symbolized by a. 

It is a common practice to use certain other Greek letters to 
symbolize statistics. Accordingly, /ii, /jlz, fin (the Greek 

letter mu) symbolize the series of statistics called '‘moments^' 
about a mean of a sample. The symbol ir refers to the constant 
3.1416. The symbols vi, 1 ^ 2 , . . . , Vn (the Greek letter nu) refer 
to moments about an arbitrary origin. 

While the use of symbols has become fairly well standardized 
in some respects along the conventional lines indicated, complete 
uniformity and consistent systematization are far from realized. 
Even the simple conventions above enumerated are not uni- 
versally followed. Nevertheless, the student will find it an 
advantage to have his attention directed to these trends in 
symbolic representation. 


TIME SERIES 

Conventional Use of X and T to Symbolize Passage of Time. A 
convention in times-series analysis is that X is used to refer to the 
passage of time. T is also used for this purpose.^ It happens 
that the same symbol, X, is conventionally used in geometry, trig- 
onometry, and the like, to refer to the horizontal axis in a plane. 
The unification of these two conventions results in the convention 
in statistics that, in making graphs of statistical time series, the 
X-axis (the horizontal axis) is used to represent the passage of 
time. Thus, the passage of time may be indicated by a series of 
X’s: Xi, X 2 , X 3 , . . . , Xn, as shown in Fig. 33, where X refers 
to years 

1 i I I r 

1940 1941 1942 1943 . . . 

Xi X2 X, X4 . . . 

Fiq. 33. 

or as shown in Fig. 34 where X refers to months. 


Jan. Feb. J^ar. Apr. . . . 

Xi Xa As X4 

Fio. 34. 

' For further discussion of the standard deviation, see Chap. VI. 
»See Chaps. XIX-XXIV. 
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As indicated, Ti, 2^2, Ta, . . . , 7'n may also represent the passage 
of time. 

Lower-case letters x and t refer to deviation from the mean; 
that^is, X — X = x; T — T = L 

Where the Variable Fluctuates in Size with Time, When the 
statistician is dealing with a variable that fluctuates in size with 
the passage of time, he refers to this variable as Y. This is a 
convention; there is no logical reason for it except that he has 
already used the symbol X or T to refer to time and wants to 
have a different symbol for the variable being studied as it fluc- 
tuates through time. This situation is described in technical 
language by saying that the variable is a function of time, by 
which is meant merely that, as time passes, the variable fluctuates 
in magnitude, one way or another. The simple symbolic way of 
saying exactly the same thing (where X refers to time and Y 
refers to the variable) is 

Y = F{X) 

There is nothing mysterious to be read into this expression. It 
is merely a use, slightly different from the ordinary one, of the 
equality sign; and the whole expression means that Y is a func- 
tion of X, or the variable which is being studied is a function of 
time, meaning that it fluctuates with the passage of time. This 
may be illustrated by one or two examples, imaginary figures 
being used. 

Time Passes in 1944 

The unit that constitutes the variable is the price of sugar per pound in 
the New York City market (average for the month of prevailing daily prices). 



X 

Y 


X, 

January 

Yi 

3 cents 

X, 

February 

Yt 

2 cents 

X, 

March 

Y, 

4.3 cents 

X, 

April 

Ya 

5 cents 

X, 

May 

Y, 

4 cents 

X, 

June 

Fe 

2.8 cents 


Thus Xi is the first unit of time (January), and Fi is the 
measurement of the variable Y at that time according to the 
designated unit of description; in other words, Fi is the price in 
January. Similarly, F 2 is the price in February (X 2 , or the 
second unit of time),, and so on. 



126 


INTRODUCTION 


The unit of time may be the week, as where the unit that constitutes the 
variable is the amount of rainfall in inches in New York City per week, 

X Y 

First week 0 . 1 inch * 

Second week 4 . 0 inches 

Third week 0 . 3 inch 

Fourth week 0 . 7 inch 

In this illustration, Xi refers to the first week, ^2 to the second 
week, etc., while Y\ refers to the inches of rainfall in the first 
week, F 2 to the inches of rainfall in the second week, etc. 

The unit of time may be the year, as where the unit that constitutes the 
variable is the net worth of a business enterprise on Jan. 1 of each year. 


X 

Y 

1936 

$20,001.00 

1937 

$28,546.00 

1938 

$21,527.00 

1939 

$20,250.00 

1940 

$27,430.00 

1941 

$35,240.00 


It is customary in geometry, trigonometry, etc., to let the 
vertical axis represent the Y variable; fluctuations in Y are shown 
by vertical distances. The unification of this custom with 
statistical presentation results in the convention that, when a 
graph is made of a variable that is a function of time, fluctuations 
in the Y variable are shown by vertical distances while time 
change is indicated along the X-axis, or horizontally. 

Figure 35, showing comparative changes in cash farm income, 
farm-mortgage debt, and value per acre of farm real estate for 
years 1910-1942, is an illustration of the graph of a time series. 

Careful Description of Units Involved. One or two matters 
concerning the units involved in time series should be noted. 
Sometimes the variable refers to an average value over a specified 
period of time; in the first illustration above, the average price 
of sugar per pound in New York City is an average over a period 
of a month. In other instances, the variable refers to a total for 
a given period of time; in the second illustration above, the inches 
of rainfall are given by totals per week. In still other problems, 
the variable refers to a quantity at the beginning of a period of 
time or at the end of a period of time; in the third illustration, 
th^ net worth of a business enterprise on Jan. 1 of successive 
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years was used. In Fig. 36, cash farm income is in totals for 
calendar years, each year’s total being expressed as a percentage 
of the average 1910-1914 yearly income. Farm-mortgage debt 
is in amounts as of Jan. 1 each year, expressed as a percentage 
of the average 1910-1914 annual amounts. Value per acre of 
farm real estate is in amounts as of Mar. 1 each year, expressed 
as percentages of the average 1912-1914 annual amounts. 

It is important in connection with the study of time series to 
know exactly how the variable is being used. Of equal impor- 



Fig. 35. — Cash farm income, farm-mortgage debt, and value per acre of farm real 
estate, index numbers, United States, 1910-1942. 


tance is it that exact indication of this should be given. Every 
good statistician invariably indicates either in titles of tables or 
in footnotes just what his variables mean. He should do this 
no matter how expert a statistician he is and no matter how clear, 
without such explanation, his work may seem to him. 

Cumulative and Noncumulative Data, Another important 
matter is the difference between cumulative and noncumulative 
data in time series. The fundamental distinction betwegi 
cumulative and noncumulative data is really the difference 
between data of “condition” and data of “change.” Cumula- 
tive data are the data of change. It is possible to add the data 
on weekly rainfall and thus obtain data on monthly rainfall 
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or yearly rainfall. Sales of a store by the week can be added to 
get sales by the month or by the year. It is possible to cumulate 
the number of births daily in order to get the total number of 
births per month or per year. Income and outgo figured are 
cumulative data. To those who have studied accounting, a 
convenient analogy is to the profit and loss statement — figures 
in the profit and loss statement in the main represent cumulative 
data 

Noncumulative data are those describing a condition and are 
not subject to the additive treatment. The average price of 
sugar per week cannot be cumulated to obtain the average price 
of sugar per month or per year. It is necessary to resort to 
averaging. The daily figures on population cannot be added 

in order to get the monthly 
population figures. A balance 
of $3,000 in the bank in January 
and of $5,000 in March do not 
give you a balance of $8,000 for 
the two months. These are 
items of condition and cannot be 
added. In order to obtain sig- 
nificant summary results in the 
case of noncumulative data over several periods of time, it is . 
necessary to average rather than to add. 

The method of averaging is applicable, not only to the non- 
cumulative, but to the cumulative type of data. It is significant 
to speak of the average daily rainfall during a given month or 
year, or the average weekly rainfall during a given month or 
year, or the average weekly sales of a given year, etc. 

Another way of referring to a time series is to describe it as the 
situation in which a variable is classified according to the time 
of its occurrence. The basis of classification is time; and the 
most logical arrangement of the data in question is that basis. 
As will be seen, the data of a time series may be reclassified, for 
certain purposes, upon a different basis, and when this is done 
they no longer constitute a time series. 

,Charting Time Series, When a time series is graphed, the 
X-axis is used to represent passage of time, while the 7-axis is 
used to represent varying magnitudes. Thus, in Fig. 36, the 
points plotted would represent a magnitude equal to 2 in 1940, 



Fia. 36. — Chart of a time series. 
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equal to 1 in 1941, and equal to J in 1942, rising to 1| in 1943. 
It is conventional to represent time series by lines or curves 
connecting the plotted points. In graphic phraseology these 
lines may be drawn through the plotted points as polygons 
Fig. 36), or the changes in direction may be curved. 

Two kinds of charts are in general use for the graphic presen- 
tation of time series: (1) arithmetic charts and (2) ratio charts. 


ARITHMETIC AND RATIO CHARTS 

Arithmetic Charts, The arithmetic chart pictures arithmetic 
changes in magnitude. For illustration, in Fig. 37 is shown a 



variable magnitude represented by the line AA\ increasing by 1 
during each time interval. This produces a straight line. On 
such a scale any variable increasing at a constant rate would give 
a straight line; but any variable increasing at a constant relative 
rate would produce an ever-steeper curve. This is illustrated by 
BB\ which shows a magnitude doubling in each interval, that is, 
increasing at a constant rate. 

The significant comparison in such a chart is always with zero, 
and hence the zero line should invariably be included in the chart. 
Leaving out the scale between zero and the point where the 
curve reaches its lowest point will give a deceptive appearance 
to the changes that occur. This is illustrated in Fig. 38, where 
P 2 is really 1| larger than Pi (see scale) but appears in the figuse 
to be twice as great because only part of the vertical scale is 
shown. 

An arithmetic chart may also be a graph of relative figures, 
in which change from time to time relative to some base is , 
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pictured. Such a graph is Fig. 35. In this kind of graph, the 
base is usually called arbitrarily 100 per cent and the relative 
changes above and below that base are graphed as percentages 
of it. Figure 39 shows a magnitude at 105 in 1941 (5 per cent 
above the base), at 95 in 1942, at 90 in 1943, and at 105 again in 
1944. It is an extensive practice to convert time series into 



Fig. 39. — Chart of time series in relatives. 


relatives, using some particular point in time as the base; and 
when such relative series or indexes (as they are sometimes 
called) are charted, the chart assumes the form indicated in 
Fig. 39. The point of departure for reading such a chart is the 
100 per cent line, which should be emphasized — the zero point 
does not have to be shown on such a chart. The relative chart 



1940 1941 1942 1943 


Fiq. 40. — ^^The percentage changes in the prices of 354 industrial stocks (1935- 
1939 as 100). {Survey of CurrerU Business, Weekly Supplement, Apr, 29, 1943.) 

should not be confused with the case in which raw data are 
a,lready in the percentage form and the zero per cent may be 
the significant point of departure rather than 100. Thus the 
raw data may be percentages of population paying income taxes 
in successive years. In such a case the zero line is important, 
the raw data themselves being in percentage figures. 
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Figure 40 is an illustration of a graph of a relative time series. 
It shows the month-to-month variation, compared with the 
avejage of monthly figures, 1935-1939, in the prices of 354 
industrial stocks; thus the average 1935-1939 equals 100 per cent. 

Ratio Charts. The second type of chart for. graphing time 
series is the ratio chart, which is designed to picture relative rate 
of change. According to Wesley C. Mitchell, the idea of the 
ratio chart was introduced by Jevons in 1863-1865.^ But the 
ratio chart did not come into general use until its advantages 
were explained by Prof. Irving Fisher and James A. Field, in 
1917.2 



The great popularity in recent years of the ratio chart has 
been largely due to the fact that special graphing paper has 
been made for the purpose, the work of making such a chart being 
thus vastly simplified. 

In the case of the arithmetic chart, equal rises on the chart 
per unit of time represent a constant rate of increase — in the 
case of the ratio chart, equal rises per unit of time represent 
a constant relative rate of increase. This is illustrated by the 
comparison of the left with the right scale in Fig. 41. This 
figure is a simple illustration showing a magnitude changing 
at the same relative rate, BB', and a magnitude changing at 
a constant rate, AA', both plotted on a ratio scale. The BB' 

} Mitchell, W. C., Business Cycles, p. 209. • 

2 Fisher, Irving, *^The * Ratio' Chart for Plotting Statistics," Publico^ 
lions of the American Statistical Association, Vol. 15 (June, 1917), pp. 577- 
601; Field, James A., ^'Some Advantages of the Logarithmic Scale," 
Journal of Political Economy, Vol. 25 (October, 1917), pp. 805-841. Cited 
from Mitchell, op. cit., p. 209. 
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magnitude doubles in each time period. The AA^ magnitude 
increases in each time period by the constant difference of 4. 

Notice the scale of logarithms at the righl;^ which corresponds 
to the scale of natural' numbers at the left. These logarithms 
are to the base 2. Thus, the log 2 of 64 is 6 because 2® = 64; 
log 2 of 32 is 5 because 2® = 32; etc. It is evident, of course, that 
while the scale at the left is in geometric progression the scale 
at the right is in arithmetical progression. This is a character- 
istic of ratio paper. Ratio charts have no zero line, and there 
is no point of emphasis. The attention is directed to the shape 
and fluctuations in the curve. In the case of the arithmetically 
ruled chart, growth at a constant difference is a straight line — 
the"^ greater the difference, the steeper the line — but it is still a 
straight line if the difference is constant. In the case of the 
ratio chart, growth at a constant relative rate is a straight line — 
the greater the constant relative rate, the steeper the line — but 
it is still straight. 

On arithmetical paper, changes in differences produce curves 
or irregular lines. On ratio paper, changes in relative rates of 
change produce curves or irregular lines. The vertical scale of 
the arithmetical chart is an arithmetic progression. The vertical 
scale of the ratio chart is in geometric progression; but the 
logarithms of the natural scale on a ratio chart are in arithmetical 
progression. For this reason, the ratio chart is often called the 
semilogarithmic chart. One method of plotting a ratio chart is to 
find the logarithms of the raw data and then plot the logarithms 
on arithmetically ruled paper. The results are the same as if 
the natural data were plotted on a ratio scale. The labor of 
looking up logarithms is avoided by having the scale made into 
a logarithmic one, upon which the plotting of natural data will 
produce the same effect as if the logarithms were found and 
plotted. This is shown in a very simple case in Fig. 41, in 
which the scale in logarithms is at the right and the scale in raw 
data units is at the left. As already explained above, the line 
5JB' represents a variable that increases at a constant relative 
ijate, while the line A A* represents a variable that increases by a 
constant quantity. In Fig. 41 the vertical distance between each 
of the scale markings on the left represents just double the 
absolute amount of the same vertical distance immediately 
below it and just half the absolute amount of the same vertical 
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distance immediately above it. In this figure the variable that 
doubles every year follows the straight line BB' (it was a curve 
in F^. 37). A variable that increases by the same aggregate 
amount each year and hence follows a straight line in Fig. 37 
would follow a curved path on a ratio chart, such as line A A’ of 
Fig. 41. 

Since the logarithm of the ratio between two quantities is equal 
to the difference between their logarithms, ratio paper can be 
easily calibrated by the use of a logarithmic scale. Thus, if 
ccpial vertical distances are taken to measure equal aggregate 
differences between logarithms, then these same vertical distances 
will represent equal relative distances (equal ratios) between the 
antilogarithms of the logarithmic scale. In Fig. 41, for example, 
the unit vertical distance is taken to be a unit difference between 
logarithms to the base 2, and the logarithmic scale on the right 
reads, 2, 3, 4, etc. Since the antilogarithm of a number to 
the base 2 is equal to 2 raised to the log 2 power, the antilogarithms 
of the logarithmic scale become 1, 4, 8, 16, etc. This is the 
scale shown on the left. It is evident that while the scale on 
the right is in arithmetic progression the scale on the left is in 
geometric progression. Accordingly, if paper is ruled so as to 
be in arithmetic progression with respect to some logarithmic 
scale but is marked or calibrated in terms of the antilogarithms 
of the logarithmic scale, any variable plotted on this paper in 
accordance with the antilogarithmic scaling will indicate a con- 
stant rate of growth or decline wherever it traces out a straight 
line. 

Most ratio paper is ruled in accordance with a logarithmic 
scale to the base 10, since this is the base of common logarithms. 
An example of this kind of ‘‘semilogarithmic paper (as it is 
often called because the vertical scale is logarithmic while the 
horizontal scale is arithmetic) is shown in Fig. 42. The reason 
common logarithms are to the base 10 is that numbers are 
arranged upon a decimal system and, by taking the base 10 for 
logarithms, the integral part of the logarithm (characteristic) is a 
mere record of the position of the decimal point in the original^ 
number. The number 10 raised to the zero power is 1, and so the 
logarithm of 1 is zero; the number 10 raised to the second power 
is 100, and so the logarithm of 100 is 2; the number 10 raised to 
the third power is 1,000, and so the logarithm of 1,000 is 3; and 
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SO on, indefinitely. Likewise any number between 1 and 10 will 
have a logarithm (to the base 10) whose characteristic is 0; any 
number between 10 and 100 will have a logarithm whose^ char- 
acteristic is 1; etc. The fractional part of a logarithm (its 
mantissa) is the same for all similar successions of similar digits. 
The fractional part of the logarithm to the base 10 for the number 
2 is the same as the fractional part of the logarithm for 20 or 
200 or 2,000, etc., namely, 0.3010; but the characteristic of the 
logarithm of 2 is 0, the characteristic of the logarithm of 20 is 1, 
the characteristic of the logarithm of 200 is 2, and so on. Thus, 
the entire logarithm of 2 is 0.3010; the entire logarithm of 20 is 
1.3010; the entire logarithm of 200 is 2.3010; etc. Hence, when 
the base of the logarithm is 10, logarithmic markings of —2, — 1, 

0, 1, 2, 3, etc., represent antilogarithmic markings of 0.01, 0.1, 

1, 10, 100, 1,000, etc. 

Semilogarithmic paper to the base 10 is usually ruled to 
represent either one logarithmic unit and the fractional parts 
thereof, corresponding to equal tenths on the antilogarithmic 
scale (called one-cycle paper ’0) or two logarithmic units and 
the fractional parts of each, corresponding to equal tenths on 
the antilogarithmic scale (called ^Hwo-cycle paper ^^), or three 
logarithmic units and the fractional parts of each, corresponding 
to equal tenths on the antilogarithmic scale (called 'Hhree-cycle 
paper '0. All three of these types of logarithmic rulings are 
shown in the right part of Fig. 42. Since the logarithmic scale 
is in arithmetic progression, these rulings would be the same for 
any logarithm differing by one, two, or three units; they would 
apply to logarithms running from —2 to 0, as well as from 0 to 2. 
Thus the corresponding antilogarithmic scale can be selected 
by the statistician in accordance with his needs. If his data run 
from 2 to 800, for example, he would select three-cycle semi- 
logarithmic paper and make his scale as indicated on the left 
of Fig. 42. If his data ran from 200 to 80,000, he would also 
select three-cycle semilogarithmic paper and make his scale from 
100 (at the bottom) to 100,000 at the top. If his data ran from 
•0.2 to 8, he would choose two-cycle semilogarithmic paper and 
make his scale from 0.1 (at the bottom) to 10 (at the top). 

Figure 42 is an illustration of a three-cycle ratio scale for the 
plotting of a time series by months for 6 years. The scale as 
drawn reads from 1 to 1,000, but it could be made to read from 
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10 to 10,000, or from 100 to 100,000, etc. At the right of the 
figure are shown the three most generally usfed types of ratio 
scales the three-cycle ratio scale, the two-cycle ratio scale, and 
the one-cycle ratio scale. If the extreme fluctuations of a time 
series are 60 and 3,000, it would be necessary to use three-cycle 



Fig, 42. — Three-cycle serailogarithmic paper. 

paper; on the other hand, if the extreme fluctuations are 60 to 
600, it would be necessary to use only two-cycle paper. 

Figures 43 and 44 are intended to illustrate the advantages and^ 
disadvantages of the ratio chart. Figure 43 shows the com- 
parative growth of some famous cities of the United States on an 
arithmetic scale, and Fig. 44 shows the same data plotted on a 
ratio scale. These data are also shown in Table 7. It will be 
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noticed that on an arithmetic scale it is not possible to bring the 
New York City population growth curve into the picture. On 
the ratio paper this is possible. Of course, on the arithmetically 
ruled paper New York City population could be plotted on a 
different scale; but then the arithmetic comparison between 
New York City and the other cities would be lost, since the 
height of the curve from the zero line is what counts in the com- 
parison on arithmetic paper. 



Fig. 43. — Growth of certain cities in the United States (arithmetic scale). 

The advantage of the ratio chart is threefold. (1) It makes 
possible a quick answer to the question as to whether a magnitude 
is changing its rate of growth. (2) It clearly pictures the rela- 
tive significance of fluctuations — ^for example, arithmetic dif- 
ferences of small magnitudes appear as important as the same 
relative differences of large magnitudes. On an arithmetic chart 
the latter would appear much larger. If an arithmetic chart of 
almost any item of production in the United States, say from 
1800 to 1940 by years, is constructed, the fluctuations in the 
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curve for the earlier period will be minute, while the fluctuations 
in the curve for the latter part will loom very large. In such 
cases, the inclusion has therefore sometimes been reached that 
instaBility is greater now than formerly. Plotting the same data 
on ratio paper would in most cases show that the earlier fluctua- 
tions were relatively as great as or greater than the modern 
ones. (3) It facilitates comparisons between time series in order 
to detect correlation between them. 



Fig. 44.— Growth of certain cities in the United States (logarithmic, or ratio, 

scale) . 

The disadvantage of the ratio chart is that it is not possible 
to make magnitude comparisons. For illustration, if the 
attempt were made to compare the actual size of Trenton, N.J., 
and New York City in 1930, an entirely incorrect impression 
would be created — Trenton would appear from the ratio chart 
to be about half as large as New York City in 1940 if vertical 
distance were assumed to be magnitude. When the ratio chart 
is used, such magnitude comparisons must be made by the use 
of the raw figures themselves, which should always be given in a 
table of figures along with the chart. 
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Table 7.— -Population of Specified Cities in the United States from 
Earliest Census to 1940 
(In thousands) 


Census 

date 

Trenton, 

N. J. 

i*ortsmouth, 

N. H. 

Omaha. 

Neb. 

New iWk 
Cityi 

1790 


4.7 


49.4 



5.3 


79.2 

1810 


6.9 


119.7 

1820 

3.9 

7.3 


152.1 

1830 

3.9 

8.0 



242.3 

1840 


7.9 


391.1 

1850 

6.5 

9.7 


696.1 

1860 

17.2 

9.3 

1.9 

1,174.8 

1870 

22.9 

9.2 

16.1 

1,478.1 

1880 


9.7 

30.5 

1,911.7 

1890 

57.5 

9.8 

140.5 

2,507.4 


73.3 


102.6 

3,437.2 

1910 

96.8 

11.3 

124.1 

4,766.9 

1920 

119.3 

13.6 

191.6 

5 , 620 . 0 

1930 

123.4 

14.5 

214.0 

6,930.4 

1940 

124.7 

14.8 

223.8 

7,455.0 


Source: Sixteenth Census of the United States, 1940, Vol. I, Population, pp. 32 and 660. 
* Refers to New York City and its boroughs as constituted in 1940. 


FREQUENCY SERIES 

Definition of a Frequency Series, A convenient arrangement 
of any set of data is a classification according to magnitude, 
that is, , from smallest to largest. In the case of a time series, 
time seems to be the most logical and workable basis of classifica- 
tion, because it seems reasonable to view things as they occur in 
time. There is a rationality about such a procedure. But 
another aspect of data, unrelated to time, may be important. 
For example, how many different prices of sugar during a given 
week differed from the average price for that week, and in what 
respect did they differ, or from how wide a range of fluctuations 
in price during the week were the respective average weekly 
prices calculated? This particular aspect would have no 
^reference to time, except as a matter of definition of the unit 
involved (one would not take prices of the third week in March 
to study the average price in the first week of March). When 
the arrangement of data according to time of occurrence is not 
significant, it seems rational to classify the data in a series from 
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smallest to largest. When this is done, the resulting series of 
data is called an array.’' 

Following is an example of an array 

« 

An Array of 10 Children in Third Grade, by Age 



Age, 

Years 

X, 

7 i 

X, 

n 

Xs 

n 

X, 

8 

Xa 

00 

X. 

Si- 

X, 

Si 

X. 

9 

X, 


X,0 

9 i 


Variable X arranged according to magnitude, where X = age 
of children in third grade, Xi = age of youngest child, etc., 
until Xio is age of oldest child. 

The situation may be one where there are a number of children 
of each age, for example: 

An Array of 18 Children in Third Grade, by Age 


X: 

Age, 

Years 

61 

X, 

6i 

X, 

61 

X4 

0| 

X, 

61 

Xe 

61 

X, 

7 

Xg 

7 

X, 

7 

X,0 

7 

X., 

71 

x„ 

71 

Xu 

71 

Xu 

71 

X15 

71 

Xu 

71 

Xu 

71 

Xis 

8 


1 The particular magnitude here taken is “age.^^ Any other common 
characteristic could be taken as the magnitude for comparison of the chil- 
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From the array, it is noticed that there is 1 child 6 J years old, 
and there are 2 children 6| years old, 3 children 6f years old, etc. 

Inasmuch as there are really only eight variations of the 
variable X, some of which occur more than once, the abbve is 
more conveniently summarized as follows: 

Number of Children op Specified Age, among 18 Children in Third 

Grade 

Age, Years, of Children Number of Children 



in Third Grade 

of specified Age 



X 

F 


X, 

61 

1 

Fx 

X, 

6i 

2 

Fi 

Xi 

6f 

3 

F, 

Xa 

7 

4 

Fa 

Xa 

71 

3 

Fa 

Xa 

71 

3 

Fa 

X, 

71 

1 

Fx 

X-s 

8 

1 

Fa 


18 * A = 


This is called a frequency series^’ or a frequency distribu- 
tion’^; the variable is listed in a column in the form of an array, 
and in a second column the frequencies of each variation are 
set down. It is merely a condensed form of the array and is 
particularly convenient, as may be readily imagined, when a 
large number of cases is studied. It will be noticed that a new 
symbol is introduced, but it is a very simple one and one that 
readily suggests itself. Fi refers to the number of times Xi 
occurs, F 2 the number of times X 2 occurs, etc. F stands in 
general for the frequency of occurrence of a variation; 18 is the 
total number of cases and is therefore the sum of the F’s, and 
this is written 2)F. {Fi + F 2 + F^ + • • • + = SF.) How- 

ever, a more general way to symbolize the total number of 
cases is to use a large N. Either ZF or N could be used, but 
it is conventional in statistics to use N to represent ZF. This Z 
is the capital Greek letter sigma, and it is always used in statistics 
to designate ‘‘sum” or “total of.” 

Nature of a Frequency Distribution and Illustration. The idea 
* of the array and of the frequency distribution in its barest 

dren, for example, height or weight. The basis must be a common charac- 
teristic or attribute that is a variable magnitude capable of quantitative 
measurement. 
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simplicity has been illustrated. From the example, it is seen 
that the frequency distribution is merely the commonplace and 
rational arrangement of a set of data in order of magnitude. 
As indicated elsewhere, this form of arrangement discloses a 
natural order that appears to persist in all things,^ namely, 
that in a large number of observations of a common characteristic 
of a thing the following tendencies exist: 

1. A large number of frequencies cluster about a central 
magnitude or average, which occurs most frequently. 

2. Small variations above and below this central magnitude 
are numerous. 

3. Large variations are much less frequent. 

4. Extreme variations are rare. 

Following is an example of a frequency distribution showing 
the number of cities of 100,000 or more population that have 
specified death rates from puerperal causes; 


Table 8. — Maternal Mortality in Cities of 100,000 or More 
Population in the United States, 1938 


Death Rates 


(Number per 1,000 


Live Births) 

Number of 

X 

F 

1- 

2 

2- 

16 

3— 

18 

4- 

20 

5- 

15 

6- 

10 

7- 

4 

8- 

6 

9- 

0 

10- 

2 


93 

Source : Bureau of the Census, “Vital Statistics," Special Reports, VoL 9, No. 7 (Feb. 10, 
1940), pp. 125-126. 


The average maternity death rate for these 93 cities is 4.8 per 
1,000 live births. It will be noted that, instead of writing Xi, 
X 2 y Xzj . . . , for each variant of the variable X, the symbol X 
is written at the head of the column, indicating that the column 
consists of Xi, X 2 , Xg . . . Xn. The symbol F is handled in a 
similar manner. Furthermore, in this illustration, class intervals 

1 See Chaps. VI and VII. 
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of 1 are used, which is signified by the dash after each of the 
numbers in the X column. This is because fractional rates are 
given in the source and not merely rounded numbers^ For 
example, the death rate from puerperal causes in 1938 m the 
city of Akron, Ohio, was 4.0; in the city of Albany, N.Y., it was 
3.1; in the city of Atlanta, Ga., it was 4.4. Since the death 
rates are given to one decimal place, if class intervals were not 
used for the frequency table, it would require some hundred or 
more rows of figures to place the death rates in an array. The 
symbol for the class interval is i. In this case, i = 10 decimal 
units, or 1. The average 4.8 was calculated by assuming that 
cases in any class interval all had the value of the mid-point 
of the interval. 

Discrete and Continuous Frequency Series, A discrete fre- 
quency series is one in which the units of measurements are. 
more or less fixed by the character of the data. The phenomena 
actually occur in such a manner that their variations in size 
proceed by distinct jumps or steps. The unit of measurement is 
fixed by this fact. An example of such a series is a frequency 
distribution of interest rates, in which the quoted variations 
in rates are likely to fluctuate by i or ^ per cent jumps and 
there are few if any intermediate variations. The variation 
in the range of the actual cases is consequently by distinct steps 
of i or } per cent. The variation throughout the range is not 
by infinitesimal amounts. The very character of the data 
determines the unit of measurement and its degree of refinement. 
Where variation proceeds in this manner, by discrete steps of 
considerable magnitude as compared with the whole range of 
variation, it is probably best not to use a class interval. If the 
number of different values of X that occur are too numerous for 
convenience, however, then the data may be grouped into class 
intervals. Great care should be employed in this case to see 
that the class intervals are chosen so that the possible values of 
X are placed in a balanced position throughout the intervals. 
For example, if values of X occur at 0, 2, 4, 6, 8, etc., then, if 
• grouping is desired, a class interval of size 4 might be chosen 
running from 1 up to but not including 5, from 5 up to but not 
including 9, etc. These would balance the actual X values 

1 For a more complete discussion of the class interval and calculation of 
averages, see Chap. VII. 
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around the center of each interval. On the other hand, intervals 
of 4, running from 0 up to but not including 4, from 4 up to but 
not including 8, etc., would result in the actual X values occur- 
ring at the lower limit and middle of each interval, causing an 
upward bias if the cases are assumed to be concentrated at the 
mid-points of the intervals, as is usual. ^ If the discrete data vary 
by steps that are small in relation to the range of variation in the 
data (e.gf., in steps of 1 cent over a range of $100), then the data 
might reasonably be treated as if they were continuous. 

A continuous series is one representing a phenomenon that 
varies by infinitesimal amounts. It may have the appearance 
on the statistical table of the same discreteness as the discrete 
series; but this is because the arbitrarily discrete character of 
the unit of measurement eclipses the actual continuous character 
of the data. In a continuous series the range of the interval 
is obtained by a process of testing and finding the one that 
appears best to smooth the data, following the general rules for 
determining the class interval discussed later. ^ Frequency 
series of all growth phenomena are of the continuous type. 
For example, the frequency distributions of weights or heights 
of people of some specified age are .continuous in character. 
In passing from one height to another, the individual must 
necessarily pass through every minute difference between; and 
accordingly in measuring the heights of individuals at the same 
age (or of mature people) the variants will be by minute or 
infinitesimal differences. The units of measurement, however, 
will make them appear discrete in character. 

Charts of Frequency Distributions, A frequency table is the 
presentation of a series of variable magnitudes, usually arranged 
from smallest to largest, in such a manner as to record the fre- 
quencies of the different magnitudes. For purposes of graphing 
it is conventional to use the a;-axis for the variable magnitude 
and the 2/-axis for the frequencies. For illustration, in Fig. 45, 
the a:-axis shows the variations of magnitude (death rates from 
puerperal causes in 1938) and the y-axis the frequencies (the 
number of cities of 100,000 or more population) of those death 
rates — so that the points appearing from the left to the right 
signify the following: 

1 Cf, Chap. VII. 

» See Chap. VII. 
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Death rates in 1938 from puerperal causes: 

2 cities have death rates between 1 and 2 per 1,000 live births. 

16 cities have death rates between 2 and 3 per 1,000 live births. 

6 cities have death rates between 8 and 9 per 1,000 live births. 

0 cities have death rates between 9 and 10 per 1,000 live births. 

2 cities have death rates between 10 and 11 per 1,000 live births. 

The points are plotted over the mid-points to indicate that 
the frequencies cover the class interval and not merely the 
rounded quantities shown on the scale. Accordingly, F\y or 2, 



Fig. 45. 

is plotted directly over 1.5; 7^4, or 20, is plotted directly over 4.5; 
etc. It is easily seen from the figure that the peak of the fre- 
quencies is in the interval containing the average. It can also 
be seen that numerous small variations from the average occur, 
but large variations from the average are few in number — that is, 
the frequency polygon slopes rapidly downward on each side 
of the average where the frequency is highest. Variations of 
c 1 below average .death rate (death rate of about 3.8) lie in the 
class interval having 18 cases; variations of 1 above average 
death rate (death rate of about 5.8) lie in the class interval having 
15 cases. Variations of 3 below and above average are much 
less frequent — only 2 cases are in the class interval containing 
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death rate 1.8, and only 4 cases are in the class interval containing 
death rate 7.8. 

Instead of a polygon to trace the direction of frequencies, 
the pi^ctice of using bars to depict frequency distributions is 
often followed. Figures 46 to 48 are illustrations of such graphs 
of frequency distributions. It is possible also to fit a curve to 
the points either by freehand or by mathematical means and 
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Per cent change in price 

Fio. 46. — Distribution of 617 wholesale price items by percentage of price 
change, 1926-1929. {National Resources Committee, The Structure of the American 
Economy, Part I, pp. i28 and 131.) 

thus describe graphically the frequency distribution by a curve, 
which is called a ‘^frequency curve.”^ 

In Figs. 46 to 48 it is interesting to compare the concentration 
of percentage changes in the three different periods, namely, 
1926-1929, when prices and economic activity were compara- 
tively stable; 1929-1932, when prices and economic activity 
were on the decline; and 1932-1937, when prices and economic 
activity were increasing. Figure 46, depicting the distribution 
of percentage price changes, 1926-1929, is quite symmetrical, 
and the slope on each side of the maximum frequency is rapid; 
the position of the mean (wholesale price index for all commodi- 
1 See Chap. VI. 
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ties, or —4.7) is close to midway between the two extreme ranges 
of the variable. In Figure 47, however, there is no such sym- 
metry. On the contrary, there is a piling up of cases in the 



Per cent {ncrease in price 


Fig. 47. — Distribution of 617 wholesale price items by percentage of price 
change, 1929-1932. {National Resources Committee^ The Structure of the American 
Economy^ Part I, pp. 128 and 131.) 


negative direction so that the slope to the left of the maximum 
frequency is gradual while the slope to the right is parabolic; the 
distribution appears to have a tail in the negative direction. 
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Percent change in price 


Fig. 48. — Distribution of 617 wholesale price items by percentage of price 
change, 1932-1937. {National Resources Committecy The Structure of the American 
Economy, Part 1, pp. 128 and 131.) 

Figure 48, on the other hand, shows the opposite tendencies, 
with the appearance of a tail extending in the positive direction. 

Figures 49 and 50 illustrate the use of frequency curves in 
chemical studies. 
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Figures 61 and 52 are illustrations of the use of frequency 
histograms in biochemical studies. 

While the frequency distribution in Fig. 45 is in the form of a 
polygon, those of Figs. 46 to 48 and 51 and 52 are shown by 
outline bars. When a frequency distribution is drawn with bars, 
the graph is called a “histogram.^^ 



Citrylidenecroioncilclehyclc ct, 
'{p-Iony/fc/eneaceta/ciehyde a. 
Citry/fclenecrofona/citchyde a. 
sewicarbazone 
^if-Ionyltdeneactfaldehydie a, 
semicarbazone 


CifryfidenecrolonoiMehyolc t, 
‘^-lony/fdeneacctafdebyde b. 
Cifryiidcnecrotonatdehydc b 
semicarbazone 
Tlf' lonylideneacefaldehyde b 
semiocfrbazone 


Fig. 49. Fig. 60. 

Figs. 49 and 50. — Analysis of the somicarbazone, melting point 178-179°, 
proved it to be derived from an aldehyde C 16 H 22 O. The position of its absorption 
maximum at 3260 A. and that of the free aldehyde (3150 A.) regenerated on 
hydrolysis with phthalic anhydride are in excellent agreement with the positions 
found for citrylidenecrotonaldehydes and their semicarbazones. [Durraclough^ 
E., J. W. Batty, I. M, HeUbron, and W. E. Jones, '‘Studies in the Polyene Series, 
Part I,” Journal of the Chemical Society (London), October, 1939, p. 1551.] 


Frequency Distribution Plotted on a Ratio Scale. At an earlier 
point in this chapter (page 131) the effect of plotting a time series 
on a ratio scale (semilogarithmic paper) was discussed. For 
some purposes the use of similar paper for the plotting of a 
frequency series is desirable. Figure 53 shows the effect of 
plotting on a ratio scale the frequency* distribution showing the 
number of cities having specified death rates from puerperal 
causes. The frequency distribution when plotted on the 
arithmetic scale as shown in Fig. 45 appears to be unsym- 
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metrical, with a steep slope on the left side and a gradual slope 
on the right side. The use of a ratio scale for the variable 



Fig. 51. Fia. 62. 

Figs. 51 and 62. — Showing distribution of daily Ca excretion for groups of rats. 
Figure 61 shows results of 792 determinations of urinary Ca (mg. /1 00 g./24 hr.) 
under standard conditions. Figure 62 shows differences (283 values) between 
test>'and arbitrarily selected control groups. In both cases the results correspond 
with a normal distribution. [Truazkowski, R., J, Blauth-Opieriaka, and J. 
Iwanowaka, “Parathyroid Hormone,” The Biochemical Journal {London), VoL 33 
(1939), p. 1007.1 



Fig. 63. — Death rates in 1938 from puerperal causes. (C/. Fig. 46.) 


magnitude (continuing the use of an arithmetic scale for the 
frequencies), as illustrated in Fig. 53, has reduced this contrast to 
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such an extent thlit the slopes on either side are almost the same 
and the frequency polygon appears to be almost symmetrical. 

An interesting application of logarithmic frequency-dis- 
tribufton analysis has recently been made in entomology,^ by 
C. B. Williams, who says: 

Mr. Yule shows that the frequency distribution of sentence length 
(f.e., number of words between successive full stops) is of the skew type 
and by comparing two different manuscripts ... he is able to produce 
convincing mathematical evidence on the identity or otherwise of their 
authorship. . . . When I converted some of Yule^s tables into diagrams 
I was struck by their general resemblance to skew distributions with 
which I have recently been dealing in some entomological problems, 
. . . which distributions, I found, became normal and symmetrical if 
the logarithm of the number was taken as a basis for subdivision into 
groups instead of the number itself. 

Taking the logarithm of the number as a basis for subdivision 
into groups instead of the number itself accomplished the same 
end as the plotting of the original groups on a logarithmic or 
ratio scale. 


* GROWTH CURVES 

Not all curves shaped like frequency polyjgons or curves are 
in truth graphs of frequency distributions. Some growth curves 
assume shapes very similar to frequency curves.^ Figure 54 
is an illustration of a growth curve, showing the increase in 
Cklorella vulgaris cultures over a period of hours. The two 
curves contrast the peak of growth for two different-sized inocu- 
lums; in both cases the rate of multiplication per cell varied 
inversely with the density of population, not only in the early 
stages of growth but throughout the growth period in each 
culture. 


BIVARIATE SERIES 

Bivariates are cross classifications of two variable charac- 
teristics possessed in common by the objects being studied. 
Graphs of bivariates are sometimes confused with frequency 

^ Williams, C. B., A Note on the Statistical Analysis of Sentence-length 
as a Criterion of Literary Style,” Bioimtrika^ Vol. 31 (1940), Parts III, IV, 
pp. 356-361. 

^ For other types of growth curves, see Chap. XX. 
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distributions because in some cases their shipe resembles the 
frequency curve. Charts of bivariates, however, may assume 
almost any shape, and the center of the distribution may have no 
more importance than any other part of it. Good exam{>les of 
bivariate comparisons may be found among the great variety 
of vital events when they are related to the different ages by 
their frequency of occurrence. 

Table 9 and Fig. 55 present a set of such distributions. These 
are clearly bivariate comparisons. The x scale in these charts 



Fiq. 64. — Growth curve showing the rate of increase in population in ChZorella 
vulgaris cultures as a function of time. [Pratt, Robertson, ''Influence of the 
Size of the Inoculum on the Growth of Chlorella Vulgaris in Freshly Prepared Culture 
M^ium,” American Journal of Botany, Vol, 27 {January, 1940), p. 54.] 

is the variation in age, from childhood to old age, representing a 
heterogeneous scale with respect to many vital events, such as 
susceptibility to certain types of disease, accident, etc. Differ- 
ence in age constitutes in itself an attribute introducing lack of 
homogeneity where such a reference is made of it. With refer- 
ence to many types of diseases, man at very tender ages and at 
old ages is a different being from man at middle life or in the 
prime of youth. Such bivariates have no reference to central 
tendencies — the matter of central tendencies is irrelevant. 
What is sought is a picture of the association between the two 
variables, and the very character of the data is such that there 
can be no expectation of a piling up of frequencies about one 




Death rate per 100,000 
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average or central tendency. Figure 66 is presented to show a 
number of examples of bivariate charts. It is readily seen that 
when the purpose is understood such charts are very useful as 
a method of picturing vital statistics; but merely because the 
shape of the two last examples resembles the frequency polygon 
it does not follow that these are true frequency distributions. 


Table 9 . — Death Hates per 100,000 Population in the United States, 
1929 , FROM Specified Causes, by Age* 


Age 

group 

T uberculosis 
of the lungs, 
male whites 

Scarlet 

fever, 

male whites 

Cerebral 
hemorrhage, 
male whites 

Broncho- 
pneumonia, 
male whites 

Puerperal 

septicemia, 

female 

whites 

0 - 

7.01 

11.29 

2.06 

182.00 


5 - 

2.27 

5.81 

0.59 

6.44 


10 -- 

3.13 

1.74 

0.47 

2.85 

0.15 

15 - 

22.37 

1.11 

0.83 

3.42 

9.94 

20 - 

56.33 

0.94 

1.50 

4.33 

23.01 

25 - 

72.28 

0.92 

2.19 

4.66 

24.72 

30 - 

80.34 

0.79 

4.24 

6.70 

22.25 

35 - 

86.17 

0.60 

9.95 

9.38 

18.48 

40 - 

95.66 

0 . 19 ^ 

21.47 

13.91 

8.18 

45 - 

101.08 

0.18 

45.22 

16.47 

0.99 

50 - 

100.32 

0.28 

85.37 

22.38 


55 - 

105.27 

0.09 

170.99 

29.77 


60 - 

102.63 

0.17 

286.15 

46.03 


65 - 

114.62 

0.23 

506.14 

77.05 


70 - 

106.77 


814.09 

124.62 


75 - 

110.39 


1 , 323.92 

238.82 


80 - \ 



2 , 015.65 

445.22 


85 -) 



2 , 477.50 

845.00 


90 -} 

76.09 


2 , 365.00 

1 , 035.00 


95 - \ 
100 -/ 


1 

4 , 320.00 

1 , 920.00 



* The rates in this table were calculated from data on total deaths by ages in the total 
registration area of the United States in 1929 according to the Bureau of the Census (1932), 
Thirteenth Annual Report on Mortality Statistics, 1929, pp. 196-197, 198-199; 202- 
203, 206-207, 210-211; and population of the United States by age groups as reported 
in the Abstract of the Census (1930), p. 183. In 1929, the death registration area of conti- 
nental United States included 95.7 per cent of the total population, 
t Rates in italics based on less«than 10 deaths. 


The odd shapes that may be assumed by bivariate charts 
are shown by these illustrations. They may be U shaped; they 
may be J shaped; they may be S shaped; and, of course, they 
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may be shaped like an ordinary frequency distribution, but when 
they are this is a matter of coincidence, without significance.^ 
Figure 56 is an illustration of a bivariate chart of data in the 
field of the natural sciences, which is shaped like a frequency 
curve and which even uses the word frequency in the title of 
one of its units, though it is not a frequency curve. It is a chart 
of a bivariate comparison — the amplitude in centimeters com- 
pared with frequency in cycles per second. 



Frequency, cycles per sec. 

Fia. 56. — Another bivariate comparison. {Clark, A.L., and L. Kalz, ^*Reso^ 
nance Method for Measuring the Ratio of the Specific Heats of a Gas, Cp/Ci„” Cana- 
dian Journal of Research, Vol. 18 {February, 1940), p. 30.] 

Figure 57 shows the relationship between inventories and 
shipments of all manufacturing industries in the United States 
and is a bivariate chart. The dotted line on the figure represents 
the average relationship of inventories to shipments based on 
the 2j^-year period from 1939 through the second quarter of 1941. 
Deviations from this relationship by the quarterly items were 
small during the base period, the expansion of inventories being 
generally in proportion to the expansion of shipments. In 
contrast, inventories increased phenomenally in relation to 
shipments during the latter half of 1941 and the first half of 

^ Cf. also such a type of frequency distribution as that described by 
Thomas V. Pearce, “An Unusual Frequency Distribution — the Term of 
Abortion,'' Biometrika, Vol. 22 (1930-1931), pp. 250-252. 
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1942. Protective buying replaced immediate production needs 
as a motive for much of the inventory accumulation during this 



SHIPMENTS. TOTAL FOR QUARTER (BILLIONS OF DOLLARS) 

00 43‘$3 

Fig. 57. — A tliird example of a bivariate comparison. [Source: Survey of Current 
Business, Vol. 23 (1943), pp.3--9.1 

cated requirements of production, assuming that the shipments 
give an indication of requirements for production. 

STATIC VARIATION AND DYNAMIC VARIATION 

In statistical analysis there are two general forms of variation. 
K^he static form of variation is that occurring at a given point 
' in time or occurring in such a manner that time may be rationally 
regarded as irrelevant to the variation. Where the variations 
«that occur are a function of time, however, the variation is 
dynamic and requires different methods of analysis.) In the 
main, the methods of analysis of static variation center in the 
treatment of the frequency distribution, whereas the methods 
of analysis of dynamic variation call for a different application 
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of principles. The same fundamentals, however, are used in 
the analysis of dynamic variation or time series as those used 
for t^e analysis of frequency distributions; only the applications 
differ. 

Rational Frequency Distributions. A rational frequency dis- 
tribution is one in which that arrangement of the data is suggested 
by the nature of the matter observed. Such a frequency dis- 
tribution is rational also because the variability of a common 
characteristic is chosen as the basis of the particular classification 
and this basis remains comparable among the objects measured. 
Frequently, the same idea is expressed by saying that the data ^ 
are homogeneous; thus a rational frequency distribution means 
one in which the variable is homogeneous. 

I Homogeneity may be defined as the condition prerequisite to 
j comparability of data with respect to the attribute or factor 
; being considered. The negative aspect of this condition is that 
attributes not being considered are judged unimportant for the 
purposes of the study in hand. The positive aspect of this 
condition is that attributes or factors judged important for the 
purpose of the study are taken into consideration. 

For example, if the attribute height of human beings is being 
considered, color of eyes may be judged irrelevant and therefore 
is not considered. But, for a homogeneous study, age, sex, and 
perhaps race are attributes that must be considered because they 
are all correlated with the attribute height and cannot therefore 
be judged unimportant in studying height. Unimportant 
attributes (those ignored) have zero correlation with the attri- 
bute studied. Attributes correlated with the attribute studied 
must be taken into consideration in order to obtain homogeneous 
data. In the example of heights, homogeneity is obtained by 
classification, that is, by taking heights of a particular class in 
which the correlated attributes are constants. Thus, heights of 
mature Caucasian males may be taken as one homogeneous 
group; another homogeneous group would be heights of mature 
Caucasian females; another WQuld be sixteen-year-old Caucasian 
males; etc. 

An important result of homogeneity is that no particular cause 
of bias or cumulated variation is present. On the contrary, the 
causes of variation consist of many minute mutually uncorrelated 
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(or independent) causes of variation that occur according to the 
law of large numbers, in other words, in a random manner.^ 

Irrational Frequency Distributions, By disregarding thp ele- 
ment of time present in a time series, whose natural arrangement 
is according to time occurrence, the data may be reclassified and 
arranged in an array, or a frequency distribution. Such a 
rearrangement would conceal the natural time sequence originally 
present in time-series data when in their natural or rational order. 
This type of frequency distribution is irrational as a method of 
summarization. The multiple forces affecting variability in a 
time series are not usually operative at random or in a mutually 
independent manner. On the contrary, the causes of variation 
may and usually do form a cumulative series of mutually depend- 
ent variations. It is to be noted in passing that Figs. 46 to 48 
are not distributions of time series. In the data for each of these 
frequency distributions the attribute summarized is for a specified 
time and all the variables are for that specified time — thus time 
is held constant, and the variation shown in the histogram is 
uncorrelated with time. These are rational frequency dis- 
tributions. But as soon as the data are viewed in their dynamic 
aspect, that is to say, are correlated with time, the many biasing 
attributes or factors of time destroy homogeneity in the data.^ 
For example, with respect to the price of sugar taken as a time 
series, the supply at a subsequent period might tend to be larger 
as a result of the relatively high price existing at the earlier 
period; and as a consequence the previous high price is a cause 
of a later lower price. The existence of a price situation, morever, 
at a given time may produce technological changes in the pro- 
duction and distribution of sugar that in turn will be a dominant 
factor in the determination of a subsequent price. 

In spite of the fact that the procedure of reclassifying time 
series and arranging the data in frequency distributions is 
irrational, it has legitimate uses in statistical analysis. There 
is a place for irrational procedure in the progressive development 
pf knowledge; but, when used, the .user should be conscious of the 
irrationality involved.® 

^ For careful consideration of the law of large numbers, see Chaps. IX-XI ; 
also see J. G. Smith and A. J. Duncan Sampling Statistics and Applications^ 
hereafter referred to by the short title Sampling Statistics. 

* For further discussion of time-series analysis see Chaps. XIX-XXV. 

* Bischer, Ludwig, The Strmture of Thought^ p. 360. Cf. for illustrations 
of such rearrangement of time series, Dickson H. Leavens, * ‘Frequency 
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VARIABLE QUALITIES, OR ATTRIBUTES 

When statisticians have to deal with variable qualities, such as 
diffeAnt colors, different races, different climatic conditions, 
different geographical locations, or different intellectual or moral 
capacities, their problems are principally questions of the con- 
sistent use of class or group distinctions. Usually there is no 
need for elaborate quantitative treatment. Yet, so far as 
possible, statisticians strive to convert quality, or attribute, 
differences into quantitative terms, and when that is accom- 
plished, their analysis is similar to the analysis of frequency 
series. It has been found, for example, that certain tests can 
be made to provide quantitative measures of differences in 
intelligence, native or acquired; and a large scope for statistical 
analysis lies in the field of education and psychology through 
the use of these tests. 

Distributions Corresponding to Time Series , Journal of the American 
Statistical Association^ Vol. 26 (1931), pp. 407-415. 



PART II 


Analysis of Frequency Distributions 

CHAPTER VI 

SUMMARIZATION AND COMPARISON 

For summarization and comparison of static variation the 
fundamental tool of analysis is the frequency distribution, its 
graphic presentation, and the analysis of its characteristics. The 
frequency distribution portrayed in a table or in a graph gives 
a picture of the whole of the variation relative to some 
particular matter; but how can comparisons be made? The 
frequency table, especially if large numbers of magnitudes are 
involved, even though it is admittedly better than a haphazard 
arrangement of the data, requires study before the mind can 
grasp its full significance. If two frequency distributions of 
heights {e,g,, of mature males in New Jersey in 1800 and of 
mature males in New Jersey in 1900) are to be compared, the 
frequency table could be used, but the total number of cases 
measured might be different in each year taken, which would 
make it more difficult to discern the similarity or lack of similarity 
of the two distributions. To make comparisons, a chart could 
be drawn; but a chart may be large or small depending on the 
scale used, and differences would then appear from purely 
arbitrary, mechanical causes having no real significance. More- 
over, if the heights of these same males and also their weights 
are to be compared, a comparison of nonhomogeneous units 
(inches of height and pounds of weight) is required. Clqarly 
some method or methods of summarization and comparison of 
frequency distributions must be devised. 

Use of Frequency Distributions. The common practice of 
attaining a summary figure by “averaging^^ is familiar to all, but 
it should be clear that an average, taken by itself, is indeed a 
• 168 * « 
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very “summary^’ expression for a variable. It is one value, used 
to represent a whole series of variations; and a study of the varia- 
tions about the average may be as important as or more important 
tfian the study of the average alone. In statistics and in most of 
the fields of study that use statistics and statistical methods, the 
average is generally a convenient point of departure for a more 
adequate analysis of the variable.^ 

Types of Comparison, There are six possible ways in which it 
may be desirable to obtain summary figures and to make com- 
parisons. This may be explained by the use of diagrams, as 
follows : 

In Fig. 58, the central tendency, or average, is located at Ay 
which is plumb with the peak 
of the frequency curve. In this 
figure the central tendency is 
typical, in the sense that it is a 
magnitude that occurs more 
frequently than other magni- 
tudes. It may be looked upon 
quite rationally as a norm, or 
typical value. In such a case, 
the average value has a signifi- 
cance for itself, as a summary 
value, but its principal use is still a comparative one. For 
example, suppose that in Fig. 58 the quantity variations (the 
X scale) are heights of children of a specified age while the curve 
represents the number of children having the indicated heights. 
The question is asked whether or not a certain child is normal in 
height. If the’child has less height than height A, how much less 
must he be so that lack of development in this respect indicates 
need for medical advice? At once it is suggested that it is impor- 
tant to determine how much on the average children vary from 
this normal height. Accordingly, the principal use of the average 
as a summary figure, when ased as a norm, is to compare indi- 
vidual variations with the average and to compare individual ^ 
variations with the average amount of variation to be expected. 

The second type of comparison is the difference in central tend- 
encies existing between two distributions a and 5, as illustrated 

^ Fisher, R. A., Statistical Methods for Research Workers (1941), Section 1. 
References to section numbers are valid for any edition of this book. # 



Fig. 58. — A central tendency as a 
norm. 
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in Fig. 69. This difference is measured by comparing the aver- 
ages of the two distributions; for example, by the comparison of 
the average height of children in third grade with the averag'e 
height of children in sixth grade. Such a comparison is rational 
only where the units of the two frequency distributions are 
comparable. 




Fia. 69. — Two different central tend- Fig. 60. — Similar central tenden- 
encies. cies but different variability about 

the central tendencies. 

The third type of comparison is illustrated in Fig. 60, In 
which some sort of measurement of the variability of the varia- 
tions about the average is required for /making the compari- 
son; for example, an average of the variations from the central 
tendency could be used. Such measures are called “ measure s of 
variability.” 



Fig. 61. — Similar central tendencies, but different types of skewness of distribution 
about central tendencies. 

Figure 61 illustrates the fourth type of comparison. Fre- 
* quency curves o and b have peaks plumb with the same quantity, 
A, but a is skewed to the left and b is skewed to the right. The 
central tendency A is a value of greatest frequency in both 
curves, but the lower range of curve a is farther from A than is 
the Jower range of b; and the upper range of a is much nearer 
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A B 

Quantify variCTlions 

Fig. 62.- -Different central tendencies 
and different variabilities. 


to A than is the upper range of 6. School grades sometimes 
have a frequency distribution like a, with the most common grade 
around 70 or 80 and with very few above 90, yet with some grades 
iJelow 20 or even as low as 10. Personal incomes are distributed 
like curve 6, with the most common income at an ampunt near 
to the lowest and a few incomes 
of amounts far above the most 
common amount. When fre- 
quency curves like a or bin Fig. 

6 1 are encountered, it is desir- 
able to have some way to meas- 
ure skewness and evaluate its 
importance in connection with 
the interpretation of other statis- 
tics about the frequency curve. 

In the fifth type of comparison, illustrated by Fig. 62, not only 
may it be desirable to compare average with average and variabil- 
ity with variability in aggregate terms, but it may be essential also 
to find a way to cony^ua-Kf relative variability. The variability in 
b relative to its average may not be so much larger than the 
relative variability in a as the graph seems to show. The graph 

shows that the absolute varia- 
bility in b is greater; but it may 
be that the relative comparison 
is the more significant one. To 
make the relative comparison 
requires the calculation of further 
information. 

Curves a and b in Fig. 63, which 
illustrates the sixth type of com- 



Quarffjfy variations 

Fig. 63. — Similar central tend- 
encies, similar variabilities, and 
absence of skewness, but different 
concentrations at center and along 
tails. 


parison. 


have the same central 
tendencies and approximately the 
same -average deviations about 
their central tendencies; but b has 
a relatively greater concentration 
of small variations close to the central tendency and also rela- 
tively more extreme variations than does a. Another way of 
looking at this difference is to note that the shoulders of a are 
broader than the shoulders of b and that the top of a is flatter 
than the tpp of b. The relative flatness of top or bread^b of 
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^ shoulder of a distribution is called ‘'kurtosis/' The measure- 
ment of this characteristic is important in determining the relative 
importance of small variations from average in the two curves. 

It appears to follow from the above discussion of six types 
of comparison that the analysis of frequency distributions requires 
the calculation of the average and, in addition, the calculation 
of measures of dispersion.^ 

^^^HEORETICAL SIGNIFICANCE OF FREQUENCY CURVES 

\/ / Histograms and Frequency Curves. It has been noted that a 
frequency distribution may be graphed in the form of a histogram, 
that is, a figure in which the frequency of any class interval is 
represented by a rectangle erected on that interval as a base and 
with a height equal to the observed frequency. ^ If the data are 
continuous in character, that is, if they change by very small 
jumps, it may become reasonable to represent the frequency 
distribution by a smooth frequency curve rather than by a broken 
histogram. 

^ Area Histograms. It is possible to make certain modifications 
in the form of the ordinary histogram to represent the frequency 
of cases occurring in any class interval, not by the height of the 
rectangle, but by the area of the rectangle. If the class interval 
is equal to unity, an area histogram is identical with one in which 
frequencies are represented by heights, since the altitude multi- 
plied by the base equals the area. But if the class interval is 
greater than unity, the height of an area histogram will be pro- 
portionately reduced ; if the class interval is less than unity, the 
height will be proportionately increased. This follows because in 
an area hist ogram the frequency of any class interval is given by 
tlae height ol tne rectangle erected on it, multiplied by the length 
of its base (that is, by the size of the class interval). In histo- 
grams of the area type, it follows that the total area of the 
histogram always equals the total number of cases, N. 

Relative Frequencies. The histogram may be further modified 
by making it represent relative or proportional frequencies, 
“rather than absolute frequencies. Following is a table showing 
a proportional frequency distribution:^ 

^ See pp, 168-195. 

* For illustrations, see Figs, 64 and 66, pp. 187-188. 

•q. p. 141. 
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Matbrnal Mortality in Cities op 100,000 ok More Population in the 
United States, 1938 


Death rates 
(number per 1,000 
live births) 

• 

X 

Number of 
cities 

F 

Relative number of cities 

Expressed as 
proportions 

F 

Expressed as 
percentages 

(B ^ 

(1) 

(2) 

(3) 

(4) 

1 - 

2 

0.022 

2.2 

2 - 

16 

0.172 

17.2 

3 - 

18 

0.193 

19.3 


20 

0.215 

21.5 

6 ~ 

15 

0.161 

16.1 

6 ^ 

10 

0.108 

10.8 

7 - 

4 

0.043 

4.3 

8- 

6 

0.064 

6.4 

9 - 

0 

0.000 

0.0 

10- 

2 

0.022 

2.2 


93 

1.000 

100.0 


In the above table the figures in column (3) represent the pro- 
portionate frequencies, namely, the proportionate number of 
cities having the specified maternal mortality rates. Since this 
illustration has a class interval of X, an area histogram could be 
obtained by plotting the frequencies of column (2) in the form of 
vertical bars, with the heights equal to the respective frequencies. 
A proportional area histogram could be obtained by similarly 
plotting the frequencies shoA\Ti in column (3); because in the 
resulting histogram, the area of each rectangle would represent 
the proportion of the total number of cases falling in a class 
interval; it would represent F /N instead of F, The total area 
of such a histogram Avill always equal unity, just as the total of 
column (3) equals unity. This will be true no matter what the 
form or shape of the histogram, because SF = iV. ^ 

Frequency Curves. Suppose that the data, from which the 
histogram has been constructed, are a sample from a very large 
set of cases, theoretically an infinite set. For instance, the 
data might be the heights of 100 adult males of the white race, 
instead of the mortality statistics above illustrated. Th% 100 
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heights, then, would be a sample of the heights of all adult men 
of that race, presumably millions of men. In such a relatively 
small sample, the size of the class interval cannot be reduced with- 
out causing the histogram to show very irregular fluctuations. 
If, however, many cases are added to the number in the sample, 
say heights of 200 men, the size of the class interval could bcb 
reduced, for example, from 10 units to 5 units, without causing 
the occurrence of such irregularities. In fact, if the number in the 
sample is made larger and larger and at the same time the size of 
the class interval is continuously reduced, the histogram will 
tend to become more and more regular and the tops of the rec- 
tangles, which are getting narrower and narrower, will come 
closer and closer to forming a smooth continuous curve (a fre- 
quency curve). In such a manner the frequency curve may be 
viewed as the limit that an area histogram of relative frequencies 
approaches as the number of cases is increased and the size of the 
class interval is reduced indefinitely. The frequency curve is the 
. distribution of a theoretically infinite set of data, with a theoreti- 
cally infinitesimal class interval. 

Being the limit approached by an area histogram of relative 
frequencies, the frequency curve has a total area (between the 
curve and the x-axis) that is always equal to unity. Further- 
more, any section of area under the curve ^ will give the relative 
frequency of the cases falling within the class interval marking 
off that section of area. It is upon this basis that tables of 
relative frequencies are constructed for certain well-known 
frequency curves.^ 

Uses of Frequency Curves. Frequency curves are hypothetical, 
but they are idealizations of frequency distributions that are real. 
They serve many useful purposes and in the theory of statistics 
they are indispensable. One important use of frequency curves 
is the graduation of frequency distributions obtained from actual 
observation. Suppose, for example, that a frequency distribu- 
tion has been constructed, using a class interval of 10 units. 
.Suppose further that the number of cases is such that any smaller 
class interval would introduce marked irregularities into the 
distribution, irregularities that, it is believed, would not be present 

^ See Fig. 91, p. 264 and Fig. 94, p. 277. 

* See Appendix, Table VI. 
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if an infinitely large number of cases were observed. In this 
case a frequency curve fitted to the distribution (histogram) 
jnay be the best means of estimating the true frequency for any 
given class interval. In other words, the frequency curve affords 
a graduation for the frequency distribution. The frequency 
curve makes it possible to interpolate values not given directly 
by the original sample frequency distribution. 

Besides serving to graduate a given set of data, frequency 
curves facilitate in other ways the description and comparison of 
frequency distributions. For instance, the peakedness or flat- 
ness of a particular frequency curve, called the normal fre- 
quency curve,’’ is taken as the standard to which the peakedness 
or flatness of a given distribution is generally referred. Again, 
theoretical analysis shows that data affected by certain kinds of 
forces will tend to be distributed in the form of particular types 
of frequency curves. Certain types of curves, therefore, become 
the expected norm for all data affected by particular kinds of 
forces. As a consequence, the hypothesis that variations in a 
given set of data have resulted from certain forces may be tested 
by noting how well the distribution of the data conforms to the 
type of frequency curve that these forces may be expected to 
produce. In such instances frequency curves help to explain the 
underlying causes of variation. Such an analysis is of special 
importance when it is assumed, as it is in so many statistical 
procedures, that chance is the fundamental cause of variation. 
It is to be noted that a difference in the general form of two 
frequency distributions may in some cases be looked upon as of 
*more fundamental importance than a mere difference in their 
averages, dispersion, and the like, because such a difference in 
form may indicate a contrast in the type of forces causing varia- 
tion in the data. To detect a fundamental difference of this 
kind frequency curves are used. 

Still another useful purpose served by frequency curs^es is in 
sampling analysis. Since a chapter is subsequently devoted to a 
discussion of sampling, it need merely be touched upon and 
simply illustrated in very general terms at this point. ^ For 

^ See Chap. XII; see also Smith, J. G., and A. J. Duncan, Sampling 
Statistics, 
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illustration, suppose that a large number of balls, each with a 
number written on it, are placed in a big bowl and thoroughly 
mixed. Suppose that 10 balls are drawn at random from th§ 
bowl and their numbers read off and averaged. Suppose that 
this sampling operation is repeated over and over again, the 
balls being replaced and thoroughly mixed after each set of 
drawings. Experience shows that unless the distribution of num- 
bers is freakish, the distribution of sample averages will approxi- 
mate the so-called normal^’ frequency curve. If, instead of 
the average of the respective 10 readings, a certain measure of 
the variation* around their averages, known as the variance, 
had been recorded in each instance, then the frequency distribu- 
tion of these measurements of variation would have tended to 
conform to a frequency curve known as the curve.^'^ The 
significant thing is that '' sampling distributions ” of this kind tend 
to conform to specific frequency curves that may be described 
by definite mathematical formulas. In general, these formulas 
are expressed in terms of the characteristics of the ‘^populations’ 
(in the illustration, the bowl of numbers) from which the sam- 
ples are drawn. The consequence is that if a random sample 
has been obtained from an unknown population, it is possible 
from knowledge of the sampling distributions of various sample 
measurements to make certain inferences regarding the nature 
of the population from which the sample has been drawn. This 
is probably the most important use that is made of frequency 
curves in statistical analysis. 

MEASUREMENTS OF SUMMARIZATION AND COMPARISON • 

Population, Parameters, and Statistics. Population. To say 
that the population of the United States is one hundred and 
thirty million people is a familiar use of the word “population.” 
In statistics the word is used in the same familiar sense, but it is 
also used in a more general sense, referring to the count of 
persons or of animals of any kind or even to the count of inani- 
mate things. To statisticians the term means all the things, 
animate or inanimate as the case may be, of a given kind in 
the known universe or in a specified universe, for example, all the 
people on the earth, or all the people in the United States if the 

1 Tjiis is read '^chi square'’; the letter is the Greek small ohi. 
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uniyerse is more specific. An example of an inanimate popula- 
tion would be all the petroleum in the known universe or, if a 
more specific universe is considered, all the petroleum in the 
\Jnited States. 

Parameters and Statistics. In the theory of statistics the 
measurements of the characteristics of the population are called 
“parameters.^’ The average height of all people living in the 
United States is a parameter of the population. No one has 
ever actually measured the heights of all the people living in the 
United States, and it is not likely that anyone ever will do so. 
Nevertheless, this population does exist. In practice, it is 
much easier to estimate the average height of all the people by 
taking the average of a sample of the people. This latter / 
average, the average of the sample, is called a “statistic.” ^ 
Accordingly, parameters are measures of the characteristics of the 
population, and the corresponding sample measures are statistics 
commonly used to estimate these parameters. A statistic is thus 
a value computed from an observed sample in order to char- 
acterize the population from which it is drawn. Parameters are-' 
the characters of the population.^ 

In accordance with this terminology, the quantities to be 
obtained as measures of central tendencies are “statistics,” the 
arithmetic mean is a “statistic,” the range (difference between 
the highest and lowest magnitude) of a frequency distribution is 
a “statistic.” 

Averages. There are several kinds of averages. The one 
most familiar is the arithmetic mean. The others most gen- 
erally presented are the median, mode, geometric mean, and 
harmonic mean. The most commonly used averages are the 
mean, the median, and the mode. In this chapter each will be 
viewed in its simplest aspect, and at the same time the symbolic 
language associated with the analysis of frequency distributions 
will be introduced. 

The Arithmetic Mean. By definition, the arithmetic mean 
is the sum of the cases divided by the number of cases. For exam- 
ple, taking a simple case of ungrouped data, i.6., where the^ 
frequencies are 1 throughout (each X occurs once), Fi, F 2 , • . . , 

Fn each = 1 : 


' Fisher; op. dt.y pp. 7-^, 41. 
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X 


F 

X, 

2 

F, 

1 

X, 

3 

Ft 

1 

X, 

4 

F, 

1 

X 4 

6 

Ft 

1 

X. 

8 

Ft 

1 

X, 

9 

Ft 

1 

X, 

10 

Ft 

1 

2 X 

= 42 

SF 

= 7 


The sum of the variable magnitudes in this case is 42. The 
number of variable magnitudes is 7. Hence, by definition, the 
arithmetic mean is or 6. 

Symbolically, 

42 = 2Z, i.e., Zi + X 2 + • • • + Zt 
7 = SF = Z, Fi + F 2 + • • * + Ft 


The arithmetic mean is represented by the symbol X, and 
hence 


r 2 :fz 


(1) 


In frequency distributions the F is not equal to 1 throughout 
but varies. An illustration of the calculation of the arithmetic 
mean of a frequency distribution is shown below: 


X 

2 

3 

4 

5 

6 
7 


F 

3 

3 

6 

9 

6 

3 


or Z = 30 


FX 

6 

9 

24 

45 

36 

2 FX - 141 


It should be noted that the sum of the Z’s cannot be obtained 
by adding the first column because the various Z^s occur 3, 6, or 
•9 times. Consequently, the sum of the Z’s is obtained by mul- 
tiplying each Z by its respective frequency and then adding 
the products. 


SFZ = 141 
SF or Z * 30 
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and therefore, by definition, 

= W = 4.7 

If the frequencies of a frequency distribution are expressed in 
relative numbers, i.e., if each frequency is expressed relative to 
the total number of cases, Fi/JV, F 2 /N, . . . , FJNy the arith- 
metic mean is merely the sum of the third column, as follows: * 



F 

F , 

X 


— X 


N 

N 

2 

0.1 

0.2 

3 

0.1 

0.3 

4 

0.2 

0.8 

5 

0.3 

1.5 

6 

0.2 

1.2 

7 

0.1 

0.7 


1.0 

4.7 


Following the definition, the arithmetic mean is the sum of the 
third column divided by the sum of the second column; but the 
sum of the second column is 1, by definition. Consequently, 
the arithmetic mean is the sum of the third column. 

(la) 

This modified form of the definition of the arithmetic mean is very 
convenient in certain statistical problems. 

The sum of the deviations of the cases from the arithmetic 
mean is equal to zero. This may be demonstrated as follows: 

Given the variable Xi, X 2 , . . . X„, X = 2FX/N. 

FiiXi - X) = F(Xi 
FiiXi - X) = F 2 X 2 

Fn(X„ - X) = FnX„ 

XFX — NX = T^Fx by adding 

The small x is used regularly to refer to the deviations of the 
variable from the arithmetic mean. 

When added, 2FX becomes NX because ^ is constant and 
2F is equal to N. 
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.If the value of given in Eq. (1) is substituted in 


XFX -NX = 

it becomes 

2EZ - N = XFx 
N 


By canceling the iV, this becomes 


and hence 


2FX - SFX = XFx 
XFx = 0 


( 2 ) 


The Median and the Quartiles, In its original simplicity and 
by definition the median is not a mathematical concept like the 
arithmetic mean. On the contrary, the median is a position 
average. By definition, the median is that value than which there 
is an equal number of cases larger and smaller. When the cases 
are arranged in an array, the median is either the value of the 
middle one (when there is an odd number) or some value between 
the two middle ones (when there is an even number). Nor- 
mally, in the latter instance, the arithmetic mean of the two 
middle cases is taken as the median value. To illustrate from a 
very simple example with an odd number of cases: 


X 

1 

2 

3 

6 

7 

8 
9 


Jl 4 , or 6, is the median by definition, because it is the middle 
one in the array. Mi = 6 (Mi is the conventional symbol for 
median). 

It is to be noted that X = 5,143. 

In this illustration it is seen that 1, the first case, is 6 smaller 
than the median, while 9, the last case, is only 3 larger than the 
median. This preponderance of smallness of the variable 
results in an arithmetic mean smaller than the median. By 
definition, the arithmetic mean is affected by every variation and 
consequently by extreme variations. It is affected by the size 
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9 and the number of cases above and below it. The median, 
on the other hand, by definition, is not affected by the size of 
the q^ses above or below it. 

When the frequencies vary, the median may be found as in 
the following illustration: 

X F 

1 2 

2 4 

3 5 

4 8 

5 7 

6 3 

7 ^ 

31 

Thus, there are 2 cases where X = 1; there are 4 cases where 
X = 2; etc. In all, there are 31 cases {XF = 31 = N), and the 
middle one is the sixteenth. Mi = 4. That the median is 4 is 
quickly seen by the examination of the cumulated frequencies in 
the third column. This is equivalent to taking the median equal 
to the {N 4* l)/2th case, a procedure often adopted when dealing 
with ungrouped data.^ 

In general terms, the first quartile Qi is that value below which 
one-fourth of the cases fall and above which three-fourths of the 
cases fall. Similarly, the third quartile Qs is that value below 
which three-quarters of the cases fall and above which one-fourth 
of the cases fall. The median is the second quartile, or Q 2 . In 
the above frequency distribution, iV/4 = 7f and 3i\r/4 = 23i. 
Qi should thus be some value below which 7f cases fall and above 
which 23i cases fall. For this distribution, it happens that the 
seventh and eighth cases are identical, and it therefore follows 
that the value of Qi is the common value of the seventh and eighth 
cases, or Qi = 3. If the seventh and eighth cases had not been 
the same, then Qi could be taken as a value between the values 
of the seventh and the eighth case to be found by interpolation. 

For ungrouped data, it is recommended that a uniform and 
systematic method of interpolation be adopted, as follows 

^ When the data are grouped, it is simplest to find the median by interpola- * 
tion within the class interval for the N/2th case. This method is described 
and illustrated in the next chapter. See pp. 218-220. 

2 For the method of interpolation when the data are grouped, see pp. 
218-220. 


Cumulated F 
2 
6 
11 
19 
26 
29 
31 
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Take Qi as the (i\r/4 + i) case, Q 2 , or Mi, as the {N/2 + i) 
case, and Qz as the (3iV/4 + i) case. Consider, for example, the 
cases 6, 9, 11, 16, 26, 31, 38, 43, 45, and 49. The (N/4t + i) case 
would be the + i) or third case; i.e., Qi = 11. The median 
would be the (-^ + i) or S^th case. Since there is no 5ith case, 
however, but only a fifth case, 25, and a sixth case, 31, the median 
is taken as the value that lies just halfway between the fifth and 

sixth cases, t.e., Mi = 25 + ^ - = 28. The third quartile 

would be the + i) or eighth case; i.e., Qz = 43. 

As another illustration, suppose 51 is added to the set of 
numbers, making a total of eleven instead of ten numbers. Then 
the first quartile would be the + i) or the 3ith case. But 
there is no 3ith case, only a third case, 11, and a fourth case, 16. 
Hence, Qi is taken as the value that lies one-fourth of the distance 
between the third and fourth cases. The difference between the 
third and fourth cases is 16 — 11 = 5, and Qi is therefore taken 
as 11 + f = 12i. Similarly, Mi is the (V- + i) or sixth case, 
which is 31. 

The Mode. As in the case of the median, the mode is not a 
mathematical concept. Moreover, it is not a ^^position average.'^ 
The mode is an average that is described in terms of relative 
frequency of occurrence. It is defined as the magnitude that 
occurs more frequently than any other. The mode is the most 
probable magnitude and might be considered a ^^probability 
average, '' because it is often thought of in terms of probabilities. 
It may be illustrated as follows: 

X F 

2 1 

3 2 

4 4 

6 7 

7 8 

8 5 

9 4 

10 ^ 

34 

By definition the mode (Mo) is 7, because this value occurs 
more frequently than any of the others* The probability of the 
mode is A and is greater than the probability of any other value 
of X. 
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It is to be noted that the X of this example is 6.706 and the 
median is 7. The mode is not affected by the size of the cases 
abovci or below it, nor is it affected by the number of cases 
above or below it, within certain rational limits. For example, 
in the illustration, two magnitudes (X = 8) could be added 
to the distribution and the mode would remain 7, as before; 
but if five magnitudes (X = 8) were added to the distribution, 
then 8 would become the mode, as its frequency would then 
be 10. 

It has been established that, when the distribution is only 
moderately skewed, the mode can be estimated from the mean 
and median by the following approximate formula: 

Mo = X - 3(X - Mi) (3) 

Accordingly, the mode may be estimated if the mean and 
the median have been calculated. In actual problems involving 
grouped data the mean and the median are both more accurately 
determinable than is the mode, and for this reason the above 
formula often gives more satisfactory results than any convenient 
direct procedure. This is called the ^^mathematical mode.^’ 
It should be emphasized that the formula should not be used, 
however, if the distribution is very skewed. 

The Geometric Mean, The geometric mean (G.M.) is a 
mathematical concept and is defined as the nth root of the 'product 
of n variables X. Accordingly, the geometric mean of 5, 8, 
and 25 is (5 X 8 X 25)^ = 10. The geometric mean may 
also be defined as the antilogarithm of the arithmetic mean of 
the logarithms of the variable X, t.e., 

log G.M. = (4) 

This may be illustrated as follows: 

log 5 = 0.69897 

log 8 = 0.90309 

log 25 = 1.39794 

= 3)3.00000 

^ 1.00000 

The antilogarithm of 1.00000 is 10; hence, G.M. = 10. 
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Just as the arithmetic mean balances the aggregate deviations 
so the geometric mean balances the ratios of .the variations; 
that is, 


G.M. 


X 

G.M. 


X • •• X 


G.M. 


= 1 


(5) 


For, by taking logarithms, this expression becomes 

log Xi + log X 2 ”h log Xz + * • * H“ log Xn N log G.M. = 0 

or S log X — N log G.M. = 0 

But from Eq. (4), 

N log G.M. = S log X 

and hence the expression is shown to be true. 

In some types of problems the geometric mean gives more 
satisfactory results than the arithmetic mean. For example, it 
is necessary to use the geometric mean to average percentage 
increases of population over successive years or decades or to 
average percentage changes in income, production, and the like.^ 
Thus, in the column marked X of the following table the esti- 
mated national income of each year is expressed as a percentage 
of the preceding year: 


Year 

Estimated national 
income produced in 
the United States,* 
billions of dollars 

Increase in national 
income expressed as 
percentage of pre- 
vious year 

X 

1933 

47.3 


1934 

54.6 

115.43 

1935 

59.2 

108.42 

1936 

68.9 

116.39 

1937 

73.1 

106.10 

1938 

67.0 

91.66 

1939 

69.7 

104.03 


1 Survey of Current Bueineaa, Vol. 20 (March, 1940), p. 19; (April, 1940), p. 11. 

If the average annual percentage increase is obtained b"- 
calculating the arithmetic average, the answer obtained is 
G42.03/6 = 107.01, which represents an average annual percent- 


^ Chaddock, R. E., Principlea and Methoda of Statistics (1925), pp. 126- 
127; Cboxton, F. E., and D. J. Cowdbn, Applied Business Statistics (1939), 
pp. £25-226. 
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age increase of 7.01 per cent. Now, if 7.01 is used as a constant 
annual percentage increase from 1933, the following figures would 
be obtained: 



Constant 7.01 Percentage 


Year 

Increase from 47.3 in 

1933 


1934 

107 . 01 per cent 

of 

47.3 

= 50. 

62 

1935 

107.01 per cent 

of 

50.62 

= 54. 

17 

1936 

107 . 01 per cent 

of 

54.17 

* 57. 

97 

1937 

107.01 per cent 

of 

57.97 

= 62. 

14 

1938 

107.01 per cent 

of 

62.14 

= 66. 

,50 

1939 

107.01 per cent 

of 

66.50 

= 71, 

,16^ 


But in 1939 the actual figure, as shown in the preceding table, 
was 69.7; and the average percentage yearly increase could 
not have been so large as 'f.Ol. To obtain the correct per- 
centage increase, the geometric and not the arithmetic mean 
should be calculated in this instance. Following the formula 
given above for the geometric mean, it is calculated for this 
problem as follows: 


Year 

Estimated national 
income produced in 
the United States, 
billions of dollars 

■ 

Logarithm of the 
percentage increase 
log X 

1933 

47.3 


1934 

54,6 

2.06232 

1935 

59.2 

2.03511 

1936 

68.9 

2.06591 

1937 

73.1 

2.02572 

1938 ! 

67.0 

1.96218 

1939 

69.7 

2.01716 


If the average annual percentage increase is now obtained 
by calculating the geometric mean of the rates of increase, 
by first taking the arithmetic mean of the logarithms 


12.16840 

6 


2.02807 


and then taking the antilogarithm (antilog;arithm of 2.02807) 
the answer obtained is 106.68 or an average annual percentage 
increase of 6.68 per cent. If 6.68 is assumed to be the average 
annual percentage increase since 1933, the following figures would 
be obtained: 
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Constant 6.68 Percentage 

Year Increase from 47.3 in 1933 

1934 106 . 68 per cent of 47 . 3 * 50.46 

1935 106 . 68 per cent of 50 . 46 * 53 . 83 ^ 

1936 106 . 68 per cent of 53.83 * 57 . 43 

1937 106 . 68 per cent of 57 . 43 * 61 . 27 

1938 106 . 68 per cent of 61.27 * 65 . 36 

1939 106.68 per cent of 65.36 * 69.73 

In 1939 the actual figure, as shown above, was 69.7, to which 
69.73 is a close approximation; and hence the average annual 
percentage increase apparently was in fact close to 6.68. 

The Harmonic Mean, The harmonic mean (H.M.) is the 
reciprocal of the average of the reciprocals of observations of the 
variable Z, thus: 


H.M. 



( 6 ) 


Accordingly, the harmonic mean of 5, 8, and 25 would be found 
as follows: 

From a table of reciprocals or by calculation, the reciprocals 
of 5, 8, and 25 are determined — 0.200, 0.125, and 0.040 — and 
hence the harmonic mean, by definition, is 

3 = _3_ _ 822 

0.200 + 0.126 + 0.040 0.366 

The geometric mean of' these three numbers, as discovered above, 
is 10; the arithmetic mean is 5 + 8 + 25 = 38 divided by 3, 
or 12.67. It is thus seen that the arithmetic mean is the 
largest, the geometric mean next, and the harmonic mean the 
smallest of these three averages. It is always true that^ 

H.M. < G.M. <X (7) 

The usefulness of the harmonic mean arises in connection with 
certain types of problems in which variable quantities of one 
variable are compared with a constant quantity of another. 
For illustration, speeds may be looked upon as variable numbers 
of miles per minuter (a constant quantity of time) or as variable 
amounts of time required to cover a given distance. Similarly, 

1 For a proof see G. R. Davies and W. F. Crowder, Methods of Statistical 
Analysis in the Social Sciences^ p. 313. 
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prices may be looked upon as variable amounts of money per 
unit of goods sold (a constant quantity of goods) or as variable 
amoujits of goods that can be purchased with $1. In many 
such problems the choice of the variable for which the quantity 
is always constant is optional, depending upon the type of 
information it is desired to emphasize. There is a nice dis- 
tinction between the mean and the harmonic mean wherever 
such interchangeability is present. This will be illustrated 
by examples. 

The accompanying table shows data on prices of com. 

Wholesale Pkice of No. 3 Yellow Corn 


Year 

Dollars per Bushel 

1913 

0.61 

1919 

1.59 

1929 

0.93 

1939 

0.50 


Source: Survey of Current Business, April, 1940, p. 18. 

In the table the amount of money varies, but the quantity 
of corn is constant. The average price per bushel may be cal- 

3 G3 

culated directly from this table, thus: = 0.9076. In order 

to obtain the harmonic mean, the reciprocals of these prices must 
first be calculated. 

Wholesale Price of No. 3 Yellow Corn 


Year 

Bushels per Dollar 

1913 

1.64 

1919 

0.63 

1929 

1.08 

1939 

2.00 


The average of these reciprocals must be computed, thus: 

^ = 1.3375. 

4 

The reciprocal of the latter number must be obtained, namely, 
0.74766. This last number is the harmonic mean of the prices 
expressed in dollars per bushel. The harmonic mean, therefore, 
of the prices per bushel of No. 3 yellorw corn is approximately 
75 cents; and it is the price per bushel of the average amount of 
No. 3 yellow corn that could have been purchased for $1, or, 
in other words, it is the reciprocal of the average amount of No. 3 
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yellow com that could have been purchased for $1. If the 
reciprocal of the mean price, 0.9075, is taken, a figure of quite 
different significance is obtained, namely, 1.102. This reciprocal 
is the average number of bushels that could have been bought at 
the mean price. 

In deciding whether to use the arithmetic or the harmonic 
mean in any given problem, it should be determined which 
magnitude should be regarded as the constant (for example, the 
amount of com bought or the amount of money spent), a matter 
that can usually be decided without difficulty in a practical 
problem. If the data are recorded with the appropriate quantity 
constant, the arithmetic mean may be used. If the appropriate 
item is made a variable as the data are tabulated, the harmonic 
mean must be used. 

Another illustration will serve to clarify further the use of 
the harmonic mean. The efficiency of a fighting airplane may 
be determined, in part at least, by its speed, which can be 
expressed either as the number of miles flown per minute or the 
amount of time required to fly a mile. Following are the results 
of tests of a plane under trial: 

Results of Tests 

Miles per minute 6, 4, 7, 6, 5 

Is the significant measure the rate at which a plane flies or 
the amount of time required to fly a number of miles? If it is 
admitted that the rate at which the plane flies is the important 
consideration (that is, the number of minutes required to fly a 
mile), the harmonic mean is the relevant measure, inasmuch 
as the recorded data make the time element constant and not the 
distance flown. The arithmetic mean, if calculated, would not 
be lacking in significance, but its reciprocal should not be com- 
pared with rate measures in which the number of miles is constant 
and the time varies. The average number of miles per minute is 
^ = 6.6. The reciprocal of this number, 0.17867, is not the 
harmonic mean, and it is not the average time that it takes to 
travel a mile. On the contrary, 0.17857 minute is the amount 
of time it requires to g<f a mile when traveling at the average 
number of miles per minute. The average amount of time 
required to travel a mile is a different thing, namely, the average 
^ h if if if ai^d i minutes, or 0.179 minute. While the two 
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results are close together in value, it would make a large difference 
in calculations having to do with hours of time if the arithmetic 
mean were used when the harmonic mean ought to have been 
used. * 

The Concept of an Average as a Summary Figure. The general 
significance of an average as a summary figure may be illustrated. 
Suppose that information concerning the heights of mature males 
in Newark, N.J., is desired. Heights of all the mature males 
in Newark are therefore measured to the nearest ^ inch. The 
data collected will constitute complete information about the 
heights of mature males in Newark. But this knowledge, in 
untabulated form without summary figures, is not easy to com- 
prehend. It is necessary to analyze this total knowledge in 
some manner so that it may become more significant. The 
manner in which the analysis will proceed depends upon the 
object in mind; for example, an answer to any one of the following 
questions may give a more significant view: 

1. What height will coincide with the greatest number of 
recorded observations? The answer to this question is, of 
course, the mode. 

2. What is the height such that greater and smaller heights 
have been recorded with equal frequency? The answer to this 
question is the median. 

3. What is the height such that the sum of the squares of the 
differences between it and the recorded observations is a mini- 
mum? Or what is the height such that the algebraic sum of 
the differences between it and the recorded observations is zero? 
The answer to this question will be the arithmetic mean. 

4. What is the height H such that the product of the ratios of 
the recorded observations to H is unity? The answer to this 
question will be the geometric mean. 

5. If several rates of speed were given in miles per second, how 
many seconds on the average will be required to travel 1 mile? 
The answer is the harmonic mean. (Heights could not be used 
as an illustration because th e Jiarmoi iic .mean is significant only 
in the du^al-vari able type of quaiitity expression, as explained 
abovey^ 

The term average '' is a generic term, and any one of these 
summary figures may be called an average; the decision as to 
which average should be used depends upon what question is to 
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be answered. * If the median height is known and the height of a 
man is greater than the median, it can be inferred that the man is 
taller than most men. If a man^s height is equal to the mode, 
it is known that he has the most common, or usual, height. If 
height is analyzed in the abstract, as it might be in research on 
the effects of heredity and environment, the arithmetic mean is 
likely to be used, for such analyses ordinarily involve the solution 
of problems in mathematical terms. 

MOMENXa 

Definition. In physics, moment” is a measure of a force with 
respect to its tendency to produce rotation. The strength of 
the tendency depends on the amount of force and the distance 
from the origin of the point at which the force is exerted. If a 
number of forces, Fi, F2, . . . , Fn, at distances Xi, Jf 2, . . . X„, 
are applied, the moment of the first force about the origin is 
FiXi, the moment of the second force is jP 2-X'2, etc. These 
moments are additive so that XFX is the total moment about 
the origin. If the total moment is divided by the total force, 
the quotient is termed ^'a moment coefficient.” The formula is 
'EFX/Nf where N = l^F is the total force. 

It will be recognized that the formula for a moment coefficient 
is identical with that for an arithmetic mean. This identity 
has lead statisticians to speak of the arithmetic mean as the 
“first moment about the origin.” Technically the mean is a 
moment coefficient and not a total moment, but in the case of 
frequency curves, with which mathematical statistics is primarily 
concerned, the total frequency N is generally taken as unity, ^ 
so that the total moment and the moment coefficient are identical. 
In any case, it has become customary in statistics to speak of 
the mean jt = ZFX/N as the first moment about the origin, and 
the distinction between total moment and moment coefficient is 
ignored. 

The concept of moments is also extended to higher powers. 
Thus in statistics XFX^/N is termed the “second moment 
about the origin,” and XFX^/N is called the “third moment 
about the origin,” etc. In general, the moments about zero 
are as follows; 

> See pp. 276-277 and Appendix, Table VI. 
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Vl 


Vs = 


SFX 
N ' 

N 




( 8 ) 


When the moments are calculated, not from zero, but from 
the mean, or “centroid,” they are defined as follows: 


'SFx 

JV 


SFa:» 

where x ^ X — X, It will be noted that it is a convention in 
statistics to use small x to represent deviations from the 
arithmetic mean. 

The first moment about the mean is the sum of the deviations 
from the mean multiplied by their respective frequencies and 
divided by the number of cases in the frequency distribution. 
The second moment is similarly obtained except that the devi- 
ations are squared before adding. To obtain the third moment 
the deviations are cubed, multiplied by their respective fre- 
quencies, added, and divided by the number of cases; to obtain 
the fourth moment the deviations are raised to the fourth power; 
to obtain the nth moment the deviations are raised to the nth 
power. This is indicated in the above equations. As already 
demonstrated above, ^ the first moment about the mean is equal 
to zero. In mechanics, the square root of the second moment 
is the radius of gyration of a set of 5 equal particles, with respect 
to a given centroidal axis.^ 

Purpose of Moments, For statistics the moments serve 
primarily as intermediary values. The moments about the 



' See pp. 169-170. 

* Cf, Rietz, H. L., Mathematical Statistics (1936), p. 20. 
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mean* are intermediary values useful for calculating measures 
of variability, skewness, and other characteristics of the fre- 
quency distribution. Because of their great convenience in 
obtaining measures of the various characteristics of a frequency 
distribution, the calculation of the first four moments about the 
mean may well be made the first step in the analysis of a fre- 
quency distribution. This valuable feature will be illustrated 
in the ensuing pages and in the next chapter. Following are 
important generalizations concerning moments measured from 
the arithmetic mean for all frequency distributions. 

Mo = 1 
Ml = 0 
M2 = 

and in symmetrical distributions, 

M3 = 0 
M6 = 0 
/i7 = 0 

for all '^odd^' numbered moments. 

VARIABILITY 

It was indicated above that the chief interest of statistics is 
in variability; summary figures such as averages are useful as 
points of departure for further study of the frequency distribu- 
tion. It may be noted that the principle of averaging is funda- 
mental throughout; for all the various methods of summarizing, 
whether it be central tendencies or variations from points of 
central tendencies, use the principle of averaging as a method of 
summarization or measurement. 

The Range. The most obvious method of measuring vari- 
ability is to take the difference between the highest value and the 
lowest value; this difference is called the '^range.'' Thus in a 
set of several hundred grades, if the highest grade is 92, and the 
lowest 13, the range is 92 — 13 = 79. The range is eagily 
understood and easy to compute, but it is dependent entirely on 
the two extreme items. It is therefore seldom used as the 
measure of variability when accuracy and stability of results are 

* Read square.” 

N 

C 
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desired. It is better to use some measure of variability that is 
dependent on more than just two, if not all, of the cases. ^ 

The Average Deviation. 

The average deviation (A.D.) is the 

arithmetic average of the variations of the data from their central 

tendency. This measure 

of variability may be illustrated as 

follows: 

Deviations from the Mean 

Variable 

(X « 6)y-v 

X 

X 

2 

-4 

3 

-3 

4 

-2 

6 

0 

8 

2 

9 

3 

10 

4 


+9 


-9 


zlxl = 18 


The deviations have to be added without regard to sign — other- 
wise, their sum is zero. In this example there are seven devia- 
tions (one of which is zero), and hence the average deviation 
is V-, or 2.57. 

A.D.X = ^ (12) 

where x = X — X. 

The average deviation could be measured from the median or 
the mode, as well as from the arithmetic mean. In fact, it is 
usually measured from the median since it is least when so 
computed.^ Let X — Mi = x'. Then the formula for the 
average deviation from the median is 

A.D.M. = (12o) 

Mi is used as a subscript to A.D. to indicate that the average 
deviation is measured from the median. It is to be noted that 
the average deviation is a measure of variability that is based 
on aU the observed cases. 

^ See, however, Smith and Duncan, Sampling Statistics, for the use of 
the range in sampling analysis. 

* C/. Kbli^by, Truman, Statisticdl Method (1923), p. 74. 



184 


ANALYSIS OF FREQUENCY DISTRIBUTIONS 


In the foregoing example each X occurred only once, and so 
the F’s were all unity. When there is more than one of any X, 
the formulas are 


A,D. 


■ _ i:\Fx\ 
* ~ iV 


A.D.MI 


s|Fa:'| 

N 


Standard Deviation, The most generally used measure of 
variability, however, is the standard deviation. This will be 
readily understood when it is seen that the standard deviation 
is easily treated mathematically; the average deviation has 
very distinct limitations in this respect owing to the disregard 
of plus and minus signs. In the case of the standard deviation 
this defect is overcome by squaring the deviations before they 
are averaged and then taking the square root of the average. 
The symbol for the standard deviation is the small Greek letter cr, 
read *'sigma.'^ By definition, the standard deviation is the 
square root of the average of the squared deviations from the mean. 
Symbolically, 



(13) 


As may be seen by comparing this definition with the definition 
of the moments about the mean [Eqs. (9)], the standard devia- 
tion is the square root of the second moment; i.e., 

<r = (13a) 

Following is an illustration of the computation of the standard 
deviation : 


Variable 

X 

1 

2 

3 

4 

5 

6 

7 

8 
9 


Deviations from the 
Mean (X *■ 6) 

- 4 . 

-3 

-2 

-1 

0 

1 

2 

3 

4 


Deviations Squared 

* X* 

16 

9 

4 

1 

0 

1 

4 

9 

16 


- 60 
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In the above illustration the standard deviation is (V)^ or 
2.58. It will be noted here that the F’s are all unity and there- 
foretdo not enter into the calculations. 

Variance. The square of the standard deviation is called the 
‘^variance.’' Since = ^Fz^/N, the variance is merely another 
name for the second moment about the mean [see the definition 
of this in Eqs. (9)]. The second moment is smaller when calcu- 
lated from the mean than when calculated from any other point. ^ 

SKEWNESS 

Definition and Significance, ^'^kewness'^ means asymmetry. 
Frequency distributions that have extreme variations resulting 
in a longer tail in one direction from the peak than in the other 
direction from the peak are asymmetrical in appearance. Such 
distributions are called ^'skewed distributions.^^ Up to this 
point the discussion has concerned methods for measuring 
central tendencies (averages) and methods for measuring vari- 
ability (average deviation and standard deviation). The 
significance of an average may be considerably modified when 
considered in comparison with the average deviation or standard 
deviation; but, in addition, the significance of the averages 
depends upon the symmetry or lack of symmetry in the dis- 
tribution of individual cases. Measures of skewness are accord- 
ingly desirable. 

Skewness Measured by Relationship between the Mean^ the 
Median^ and the Mode. Skewness is most easily measured by 
the relationship between the mean, the median, and the mode; 
for it will be recalled that the mode is not affected by the mag- 
nitudes or number of variations above or below it. The median 
is affected by the number of variations, but not by their mag- 
nitudes. The mean is affected by both the number and the 
magnitudes of the variations from it. Consequently, it would be 
expected that 

1. The mode, by definition, will remain at the point of greatest 
frequency, whether or not distribution is skewed. 

2. The median will be pulled away from the mode in dis- 
tributions that are skewed, since the larger number of items on 
one side pull it from the point of greatest frequency. 


* C/., for pfoof, p. 216 . 
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3. The mean in such distributions will be pulled away from tKe 
mode still farther since the larger number and extreme magnitude 
of the items on one side pull it farther away from the poi^t of 
greatest frequency. 

These points are illustrated in the following frequency dis- 
tributions:* 


1. Symmetrical 

2. Skewed Positively 

3. Skewed Negatively 

X 

F 

X F 

X 

F 

1 

3 

1 6 

1 

1 

2 

4 

2 8 

2 

2 

3 

6 

3 9 

3 

3 

4 

9 

4 8 

4 

5 

5 

10 

5 7 

5 

7 

6 

9 

6 5 

6 

9 

7 

6 

7 3 

7 

10 

8 

4 

8 2 

8 

8 

9 

3 

9 1 

9 

6 


54 

49 


51 


(1) 

(2) 


(3) 


X « 5 

X « 3.92 

X 

- 6.10 


Mi - 5 

Mi = 3.69 

Mi 

= 6.33 

Mo » 5 

Mo « 3 

Mo 

= 7 


a = 2.08 

<r = 2.05 

<r 

= 2.01 


Figures 64 to 66 show in graphic form the frequency dis- 
tributions given on this page. The relationship between the 
three averages will be more clearly visualized from these figures. 

In Fig. 64, for example, all three averages equal 5. In Fig. 65, 
which is the positively skewed frequency distribution, the three 
differ from esich other; the mode is 3, the median is 3.69, and the 
mean is 3.92. The extreme variations toward the higher values 
give the frequency distribution a longer tail to the right, or 
toward the higher values of X, and this pulls the median and the 
mean in that direction from the mode. 

In Fig. 66 a negatively skewed frequency distribution is 
illustrated. This histogram is a graph of the third set of figures 
shown on this page. In this figure, too, the averages differ 
from each other, but here the mode is the largest. The extreme 
variations toward the lower values give the frequency dis- 
tribution a longer tail to the left, or toward the lower values of X, 

t Calculations of averages were made on the assumption that the x values 
are mid-points of class intervals of unity. 
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and this pulls the median and the mean in that direction from 
the mode ; so while the .mode is 7, the median is 6.33 and the 
me%n is 6.10. 



Fig. 64. — A symmetrical frequency distribution. 



Fig. 66. — A positively skewed frequency distribution. 


Skewness may accordingly be measured by the difference 
between the mean and the mode. For the above examples, 

Z - Mo = 0 X - Mo = +0.92 Z - Mo = -0.90 • 

This is a measure of the aggregate amount of skewness, but 
how significant is this amount? One device to weigh this is to 
compare the aggregate amount of skewness with the standard 


188 ANALYSIS OF FREQUENCY DISTRIBUTIONS 


deviation and thereby to obtain a coefficient of skewness, as 
follows: 

For example (2) , 


sk 


X - Mo 0.92 
<r 2.05 


0.45 


(14) 


For example (3) 
sk 


JC -Mo 


a 


-0.90 

2.01 


-0.45 


The relative amount of skewness, or asymmetry, in these two 
distributions comes out equal, although one is positive and the 
other is negative. 



Fia. 66. — A negatively skewed frequency distribution. 


Another measure of skewness is based on the median and 
the mean. It has been established that in a moderately asym- 
metrical distributioii, if the mean is pulled a distance P away 
from the mode, the median is pulled approximately two-thirds 
P away from the mode in the same direction; that is, 

X - Mo = 3(X - Mi) 


Hence, skewness can also be measured by three times the distance 
between the mean and the median, as follows: 


(T 


(14a) 


The second equation has the advantage over (14) in that the 
median is often easier to locate than the mode. The mode is 
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frequently difficult t6 locate in a sample distribution and is 
subject to wide fluctuations of sampling; in addition, its location 
is often dependent merely upon the selection of the class interval. 

Sltewness Measured by the Medians and Quartiles, A simple 
measure of skewness, and one that is easily comprehended, is 
obtained by comparing the location in a frequency distribution of 
its median and quartiles. This can be illustrated by taking the 
same three distributions used above and calculating their 
respective first and third quartiles (the medians are already 
calculated). From Fig. 67 it is seen that in the symmetrical 
distribution the first quartilc is smaller than the median by the 



Fig. 67. — Relation between the first and third quartiles and the median of a 
symmetrical distribution." 

same amount that the third quartile is larger than the median — 
Qi and Qs are equidistant from the median; accordingly, 

Q3 - Mi - (Mi - Qi) = 0 

If the terms are rearranged, this may be written 

Qi Qs — 2Mi = 0 

From Fig. 68, the positively skewed case, it is seen that the 
first quartile is smaller than the median by an amount con- 
siderably less than the amount by which the third quartile is" 
larger than the median; accordingly, 

Qs-Mi- (Mi - Qi) = +0.222 
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From Fig. 69, the negatively skewed distribution, it is seen 
that the first quartile is smaller than the median by an amount 
considerably larger than the amount by which the third quartile is 
larger than the median; for Qa — Mi — (Mi — Qi) = —0.15)4. 



Pig. 68. Relation between the first and third quartiles and the median of a 
positively skewed distribution. 



Fio. 60.'~-Relatioxi between the first and third quartiles and the median of a 
negatively skewed distribution. 


If the location of the quartiles compared with the location 
of the median, in each distribution, is now compared with half 
the distance between the two quartiles (i.c., the average dis- 
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tance of the two quartiles from the median), a coefficient or 
relative measure of skewness is obtained and the formula that 
me^ures skewness is as follows: 

nl- _ Qs - Mi - (Mi - Qi) Qi + Qi- 2Mi „ 

2 

where Q is used as a symbol signifying the semiquartile range. 

Not only is the coefficient of skewness based upon the median 
and quartiles a good one because these are usually easy to find, 
but also this coefficient of skewness has the advantage that it 
has value limits between +2 and —2. It can be no greater 
than +2, that is, when positive skewness is so great that the 
median equals the first quartile. It can be no greater than —2, 
that is, when negative skewness is so great that the median 
equals the third quartile. In Figs. 68 and 69, respectively, 
the coefficients of skewness are +0.146 and —0.106 expressed 
as ratios. Expressed as percentages of skewness, they are 
+ 14.6 per cent and —10.6 per cent. 

Third Moment as a Measure of * Skewness. The cube root of 
the third moment is also a good measure of skewness. This is 
due to the fact that (1) if the distribution is symmetrical, SFx® 
will be zero; but (2) if the distribution is not symmetrical, the 
third moment will not be equal to zero but will have a positive 
or negative value according to whether the distribution is 
skewed positively or negatively. This is illustrated by simple 
examples showing a symmetrical and a positively skewed distri- 
bution. The negatively skewed distribution is left to the 
student to work out. 


1. Symmetrical Distribution 


X 

F 

X 

Fx 

Fx* 

Fx* 

1 

2 

-3 

-6 

18 

-54 

2 

3 

-2 

-6 

12 

-24 

3 

4 

-1 

-4 

4 

- 4 

4 

5 

0 

0 

0 

0 

5 

4 

1 

4 

4 

4 

6 

3 

2 

6 

12 

24 

7 

2 

3 

6 

18 



23 


- 0 

XFx* “ 68 

SFa;* «« 0 


.V 
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2. Positively Skewed Distribution 


X 

F 

X 

Fx 

Fx^ 

Fx^ 

1 

6 

-2 

-12 

24 

-48 

2 

8 

-1 

' - 8 

8 

-i; 8 

3 

10 

0 

0 

0 

0 

4 

4 

1 

4 

4 

4 

5 

3 

2 

6 

12 

24 

6 

2 

3 

6 

18 

54 

7 

1 

4 

4 

16 

64 


34 


SFa; = 0 

SFx* = 82 

ZFx^ = 90 


In the second example, the third moment is equal, not to 
zero, but to The measures of skewness by this method 
would be It- Symbolically, this measure of skewness is 


sk 




N 


= mI 


(16) 


It may be seen from the definition of the third moment [Eqs. 
(9)] that this measurement of skewness is the cube root of the 
third moinent. Expressed as a coefficient of skewness, where 
the aggregate amount of skewness is in terms of the standard 
deviation, this measure of skewness is as follows: 



The Beta Coefficients, This last measure of skewness is of 
particular interest, not only because it is based on a wholly 
mathematical procedure (it is not dependent on nonmathe- 
matical summaries like the median and quartiles or the mode), 
but also because it is directly related to one of the so-called 
‘‘beta coefficients/^ The beta coefficients are functions of the 
moments of the frequency distribution that have been found 
very useful in describing and distinguishing various types of 
frequency distributions.^ The two principal beta coefficients 
are and P 2 t which are defined as follows: 


/3i 


- iff 

M2 


M2 


It will be noted that the sixth root of fix is identically the coeffi- 
cient of skewness, sk =5 juf/or, for = o’*. Frequently \/%, is itself 

* Smith and Duncan, Sampling Statistics, 
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taken as a measure of skewness, the square root being given the 
same sign as 2Fx^ from which the third moment is calculated. 

lyhen the beta coeflScients are used to describe a frequency 
curve according to a formula developed by Karl Pearson,^ the 


skewness for this curve as measured 


by 


X - Mo 


is found to be 


equal to 


^ V Fx (02 + 3) 
2(5^ - - 9) 


(17) 


When the data are such as to warrant the fitting of a smooth 
frequency curve, this is an excellent formula for measuring 
skewness. The curve, of course, does not have to be fitted to 
make use of the formula as a measure of skewness. 

When fii is small, i.e., when the skewness is slight, and when ^2 
is approximately equal to 3, as it is in the case of a normal dis- 
tribution, ^ this last equation shows that \/Wi is approximately 

equal to twice ^ the mode being that of the fitted 


Pearsonian curve. When the latter is calculated by the approxi- 
mate formula Mo = X — 3(X — Mi), the calculation of half 
the square root of will in certain cases serve as a rough check 

on the computation .of skewness from the formula sk = — — ^ 


The importance of the second beta coefficient lies in the fact 
that it is a measure of kurtosis. 


KURTOSIS 

Definition, Kurtosis is described by Karl Pearson as follows:^ 

Given two frequency distributions which have the same variability 
as measured by the standard deviation, they may be relatively more or 
less flat-topped than the normal curve. ... A frequency distribution 
may, in other words, be symmetrical, but it may fail to be mesokurtic 
(equally flat-topped with the normal curve), and thus the Gaussian 
curve cannot describe it. , 

1 Cf, Smith and Duncan, Sampling Statistics. 

^ See the next section. 

* “Skew Variation, A Rejoinder,” Biometrika^ Vol. 4 (1906), p. 173. 
Cited from H. M. Walker, Studies in the History of Statistical Method, p. 182. 
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The ‘‘normal curve to which this quotation refers is repre- 
sented by the equation 


V = 


1 

<r y/^ 


(18) 


Because this curve has arisen so frequently in statistics and 
because it has been used as a type with which to compare other 



Fig. 70. — Frequency distiibutions with greater and with less kurtosis than the 

normal curve. 


frequency curves, it has come to be known as the normal curve. 
Also, since Gauss early recognized its importance, it is sometimes 
called the Gaussian curve. ^ 

^ As shown in graphic form earlier in this chapter (Fig. 63), 
when there is a marked concentration of very small variations 
about the central tendency, the frequency curve rises to a high 
peak, unlike the normal, or Gaussian, curve, which has a certain 
* C/. pp. 294-296. 
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roundness at the top. Kurtosis is a measure that makes it 
possible to describe the relative degree to which this character- 
istk exists with reference to any frequency distribution. For 
the normal curve, the relationship between the second and the 
fourth moments is as follows: 


3 

If the ratio of the fourth moment to the square of the second 
moment is less than 3, the curve is flatter than the normal curve; 
and if the ratio is greater than 3, the curve is more peaked than 
the normal curve. Figure 70 shows three frequency distribu- 
tions, with P 2 equal, respectively, to 2, 3, and 4; the standard 
deviations of the three curves are equal. It should be noted 
that the smaller area at the peak of the flat-topped distribution 
is accompanied by a loss of area at the tails of the distribution. 
This loss of area at the peak and the tails is compensated by the 
fact that the curve is higher than the normal curve on each side 


Table 10, — Computation of Three Frequency Curves^ 
^2 ~ 2 ^2 ~ 3 (^2 ~ 4 


X 

<r 

<t»0 
= 3 

04 

A 04 

— A04 

= 4 

02 ^ 2 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

0.0 

0.3989 

1 . 1968 

0.0499 

-0.0499 

0.4488 

0.3490 

0.2 

0.3910 

1.0799 

0.0450 

-0.0450 

0.4360 

0.3460 

0.4 

0.3683 

0.7607 

0.0317 

-0.0317 

0.4000 

0.3366 

0.6 

0.3332 

0.3231 

0.0135 

-0.0135 

0.3467 

0.3197 

0.8 

0.2897 

-0.1247 

-0.0052 

0.0052 

0.2845 

0.2949 

1.0 

0.2420 

-0.4839 

-0.0202 

0.0202 

0.2218 

0.2622 

1.2 

0.1942 

-0.6925 

-0.0288 

0.0288 

0.1654 

0.2230 

1.4 

0.1497 

-0.7364 

-0.0307 

0.0307 

0.1190 

0.1804 

1.6 

0.1109 

-0.6440 

-0.0268 

0.0268 

0.0841 

0.1377 

1.8 

0.0790 

-0.4692 

-0.0195 

0.0195 

0.0595 

0.0985 

2.0 

0.0540 

-0.2700 

-0.0112 

0.0112 

0.0428 

0.0652 

2.2 

0.0355 

-0.0927 

-0.0039 

0.0039 

0.0316 

0.0394 

2.4 

0.0224 

0.0362 

0.0015 

-0.0015 

0.0239 

0.0209 

2.6 

0.0136 

0.1105 

0.0046 

-0.0046 

0.0182 

0.0090 

2.8 

0.0079 

0.1379 

0.0057 

-0.0057 

0.0136 

0.0022 


1 Columns (2), (6), and (7) give the ordinates of the three curves. Column (6) -» (2) + 
(4), and column (7) (2) + (5), account being taken of the signs. The figures are derived 

from the formula for a Gram-Charlier frequency curve. See Smith and Duncak, SamjAing 
Stalisties, 
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between the peak and the tails. In other words, it is flat both 
at the peak and at the tails and is high in the shoulders. In 
contrast, the more peaked frequency curve is higher than the 
normal curve at both the peak and the tails and lower than the 
normal curve on each side between the peak and the tails. Its 
shoulders are lower than those of the normal curve. 

SUMMARY 

There are two ways in which frequency distributions differ 
from the so-called ^^normaP’ frequency distribution.^ (1) Fre- 
quency distributions may have a higher or lower peak than the 
normal frequency distribution. This relative flatness or lack of 
flatness of the peak relative to the normal curve is called “kur- 
tosis.^' (2) Frequency distributions may have a preponderance of 
variability to the large values or to the small values. This lack 
of symmetry in variability is called “skewness.’^ The normal 
distribution and the concepts connected with its analysis con- 
stitute a convenient point of departure for the general analysis 
of variability. In this study of variability, the characteristics of 
kurtosis and skewness are of great importance for the reason that 
a large part of the phenomena studied have characteristics pro- 
ducing frequency distributions that are not normal.^' Even as 
early as the time when Sir Francis Galton was developing his 
theory of correlation (1877-1889), writers on mathematical 
statistics realized that the univariate normal law of De Moivre 
and Laplace could not be regarded as a universal law of fre- 
quency distribution; the presence of skewness in homogeneous 
material was certainly as common as that of normality.^ 

It is important to realize that the function of frequency- 
distribution analysis is not primarily to define and measure 
averages but to define, describe, and measure variability. 
Simple averages have relatively limited uses and may lead to mis- 
interpretation rather than clarification if used without refer- 
ence to the measures of variability, skewness, and kurtosis. 

1 For further description of the normal curve, see Chaps. X and XI. 

sPRETORius, S. J., Biometrika^ Vol. 22 (1930-1931), pp. 109-223. Cf. 
Elderton, W. Palin, Frequency Curves and Correlation (1927). Rietz, 
H.' L., Handbook of Mathemaiical Statistics j Chap. VII, Frequency Curves, 
pp. 92-119, by H. C. Carver, and pp. 28^239 by W. A. Shewhart. 
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In this chapter some of the most generally used methods of 
analysis of frequency distributions have been presented in 
elenfentary form, in order to keep clear of the complications of 
practical application. It will now be easier to see how these 
methods are applied and just what complications enter into their 
application to real problems. The following chapter gives an 
example of*such an analysis, using data from real life. But it will 
be well to close this chapter with a summary of the symbols that 
have thus far been used and that include by far the majority of 
all the symbols used in statistics. Most of the symbolic language 
can be learned from this chapter. If they are mastered, the few 
additional ones will be easy to learn. 

Summary of Symbols 

A variable Jfi, X2, As, . . . , Xn or in general X 
Frequencies Fi, F2, F3, • • • , Fn or in general F 
Sum of 2 (Greek capital sigma) 

Number of cases N (equals 2F) 

Arithmetic mean X 

Deviations from the mean xi^ X2, Xs, . • • , Xn or in general 

X (equals X — X) 

Median Mi 
First quartile Qi 
Third quartile Q3 
Mode Mo 

Deviations from the median x[j Xg, Xg, . . . , x',» or in general 

x' (equals X — Mi) 

Geometric mean G.M. 

Harmonic mean H.M. 

Average deviation A.D. 

Standard deviation <r (Greek small sigma) 

Chi square (Greek small chi) 

Skewness sk 
Moments: 

a. About arbitrary origin: 

Vh ^ 2 , ... y Vn 

b. About the arithmetic mean: 

Ml, M2, M3, . . . , Mn 

Kurtosis ^2 (Greek small beta) ^ 
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Following is a summary of the formulas that have been used in 
this chapter: 


Summary of Formulas 


(1) ji: = 


LFX 


also 


N 

(2) LFx = 0 

(3) Mo = Z - 3(X - Mi) 

(4) logG.M..^Jf^ 


F 


(5) 


V V 

^ r^ -KH ^ 


G.M. 

(6) H.M. = - 


G.M. 

X 


X 


G.M. 


= 1 


(7) H.M. < G.M. < X 

LFX ZFX^ 

(8) VI = 


(9) Ml = 


M2 = 


XFx^ 


2FZ» 

N 

Fn = 


SFa;» 


N N 

(10) MO = 1, Ml = 0, M2 = 

(11) M« = 0, M6 = 0, M 7 = 0, . . . in a symmetrical distribution 

2|Fx| _ S|F2:'| 

A.iJ.Ml ^ 


N 


(12) A.D.X = 

/W 

(13) , - . 


fI 


(14) sk = 


Z - Mo 


also 


sk.3(Z^ 


(15) sk = g? ... + 9^ - 

(16) sk = mI 

flTl sk - (^8 + 3) 

(17; sk - 

1 iX-X)* 

(18) y 


\/^ 


2a* 
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ILLUSTRATION OF FREQUENCY-DISTRIBUTION 
ANALYSIS 

Data for a Frequency Distribution. Data selected to illustrate 
frequency distribution analysis are presented in Table 11. 
Heights in inches of 300 members of the freshman class of 1943 
were obtained from the records of the ]3epartment of Health and 
Physical Education, Princeton University. As presented in the 
table the data are not arranged in a frequency distribution; 
they are listed at random. In order to make a frequency dis- 
tribution of the data it is necessary first to decide on the size and 
limits of the class interval to be used in the construction of the 
distribution; for the freciuency distribution per se is a method of 
summarization compared with the manner of presentation of the 
data in Table 11. 

CONSTRUCTION OF A FREQUENCY DISTRIBUTION 

The Class Interval. Rules for Determining the Class Interval, 
The class interval is the unit of the frequency distribution; in 
other words, it is the size of the groups in which the data are 
summarized. In the data selected for illustration should the 
groups be 1-inch, flinch, J-inch, or tVinch groups? That is, 
should the class interval be 1 inch, a half inch, a quarter inch, or a 
tenth of an inch? 

A general rule for selecting the class interval is that it should be 
such as to make possible, without serious error, the treatment of 
all values assigned to any one of the classes as if they were equal 
to the mid-point or mid-value of the class. The lower limit of the 
class intervals also should be so selected as to facilitate this end. 
If the cases are concentrated, in fact, at the mid-point of the class 
interval or are evenly distributed throughout, it may, without 
’ serious errors in calculation, be assumed that all cases in the class 
are equal in value to the mid-value. 

Another guide in the selection of the class interval is that it 
should be as large as possible subject to the first condition vand 

199 ^ 
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to the condition that the interval should not be so large as to 
conceal too much of the character of the variability. Indeed, the 
most important purpose of the class interval is so to summarize 


Table 11 . — Heights, 300 Eighteen- year-old Members op the Class op 
1943 , Princeton University 
(In inches) 
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the data in a frequency distribution as to disclose more clearly 
the character of the variability. If a very small class interval is 
chosen, the character of the variability will not be visible unless a 
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very large number of cases are measured; if a very large class 
interval is chosen, significant irregularities in the data may be 
concealed. 
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Ordinarily, the size of the class interval should be uniform 
throughout, becajise different-sized class intervals will compli- 
cate calculations. In some cases, however, it is necessary to 
use different-sized class intervals in order to give a proper picture 
of variability. 

If other more important rules are not thereby violated, in the 
interests of simplicity the position of the class interval in the 
range should be such that the limits of the intervals are integers 
or such that the mid-values of the class intervals are integers. 
Where marked concentration about certain values exists, as is 
sometimes the case in dealing with discrete data, these values 
should so far as possible be made the mid-points of class intervals. 

An Array of the Data, Intelligent determination.of the class 
interval is aided by sthdy of the data arranged in an array or 
scatter diagram such as Fig. 71, which is presented to illustrate 
the determination of the proper class interval. In the figure, 
the heights shown in Table 11 are arranged in an array. Because 
inspection of the data in Table 11 led to the suspicion that con- 
centration points were present at the -J-, i-, and |-inch values, 
the array is presented in rows with these concentration points 
plumbed. Summing the columns as well as inspection of the 
detail of the scatter diagram show the concentration of fre- 
quencies at these values. 

Frequency Distribution with Too Many Class Intervals, Exami- 
nation of Fig. 71 suggests that a i-inch class interval beginning 
at 61.876 inches, as shown in Table 12, might be a good class 
interval for the data of Table 11, for the i-inch class interval 
with the lower limits as shown in Table 12 places the mid-values 
of class intervals at points of concentration. Such a frequency 
distribution contains over 60 rows, however, and, in addition,, 
is uneven and irregular in appearance. Ten frequencies occur 
in the interval 66.875-; only five frequencies occur in the interval 
68.625-; twelve frequencies occur in the intervals immediately 
below and above 68.625-. Moreover, it is not clear whether 
the modal class interval is 69^, 70i, 70f, 71, or 72 inches; because 
an equal number (15) have each of these five heights. 

The i-inch class interval is too small in this instance to dis- 
close clearly the nature of variation in freshman heights. 

A Larger Class Interval Reveals the Character of Variation, ' If 
1 inoh >9 taken ae the class interval, the frequency distribution 



FREQUENCY-DISTRIBUTION ANALYSIS 


203 


Tabus 12. — FsEsguENCT Distbibtjtion of the Heights of 300 Princeton 
^ Freshmen, Class of 1943 


Heights of freshmen 

• X 

Number of freshmen having 
specified heights 

Interval 

Mid-value 

F 

61.875- 

62.00 

1 

62.125- 

62.25 

0 

62.375- 

62.50 

0 

62.625- 

62.75 

0 

62.875- 

63.00 

1 

63.125- 

63.25 

0 

63.375- 

63.50 

1 

63.625- 

63.75 

0 

63.875- 

64.00 

0 

64.125- 

64.25 

1 

64.375- 

64.50 

0 

64.625- 

64.75 

0 

64.875- 

65.00 

1 

65.125- 

65.25 

0 

65.375- 

65.50 

1 

65.625- 

65.75 

2 

65.875- 

66.00 

1 

66.125- 

66.25 

1 

66.375- 

66.50 

6 

66.625- 

66.75 

3 

66.875- 

67.00 

10 

67.12.5- 

67.25 

4 

67.375- 

67.50 

12 

67.625- 

67.75 

5 

67.875- 

68.00 

9 

68.125- 

08.25 

6 

68.375- 

68.50 

12 

68.625- 

68.75 

5 

68.875- 

69.00 . 

12 

69.125- 

69.25 

9 

69.375- 

69.50 

15 

69.625- 

69.75 

11 

69.875- 

70.00 

12 

70.125- 

70.25 

6 

70.375- 

70.50 

15 

70.625- 

70.75 

15 

70.875- 

71.00 

15 

71 . 125- 

71.25 

10 

71.375- 

71.50 

9 

71.625- 

71.75 

8 

71.875- 

72.00 

15 

72.125- 

72.25 

2 

72.375- 

72.50 

12 

72.625- 

72.75 

6 

72.875- 

73.00 

7 

73.125- 

73.25 

5 

73.375- 

73.50 

5 

73.6^5- 

73.75 

4 

73.875- 

74.00 

6 

74.125- 

74.25 

3 

74.375- 

74.50 

3 

74.625- 

74.75 

2 

74.875- 

75.00 

1 

75.125- 

72.25 

3 

75.375- 

75.50 

1 

75.625- 

75.75 

3 

75.875- 

76.00 

0 

76.125- 

76.25 

1 

76.375- 

76.50 

1 

76.625- 

76.75 

0 

76.875- 

77.00 

0 

77.125- 

77.25 

1 

300 
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will contain 17 classes and will appear as shown in Table 13. 
In this frequency distrilbution the lower limits of the class inter- 
vals are so chosen that mid-values are at the 0.625-inch points 
(f inch), which is a balancing center of the concentration f)oints 
at the -J-inch intervals because at | inch each mid-value has two 
l-inch concentration points below it and two above it in the 



Fia. 72. — Distribution of heights of 300 Princeton freshmen. (Class interval 

« i inch.) 



1-inch class interval. This balancing position of the f-inch 
points can be readily seen by an examination of Fig. 71. 

In order to contrast the irregularities in the frequency dis- 
tribution using too small a class interval with the regular appear- 
ance of the same frequency distribution using a larger class 
interval, Figs. 72 and 73 are presented. Figure 72 is a graph 
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of the frequency distribution of heights of 300 Princeton fresh- 
men, using a i-inch class interval. Figure 73 is a graph of the 
frequency distribution of heights of 300 Princeton freshmen, 
using 1-inch class interval. 

The argument for a class interval centered at the f-inch point 
has been based on the assumption that measurements have been 
made to the nearest i inch. In other words, a height recorded 
as 64.25 might be anything between 64.125 and 64.375. If 
measurements were always made to the lowest i inch, then some 
other mid-point would be warranted such as the ^-inch points, 
or integral values. Table 14 is one based on this assumption. 
Since the exact method of measurement is not known and since 
Table 14 is simplest in form, it is adopted for subsequent analysis. 
A graph of the distribution has already been shown in Fig. 71. 

In frequency Tables 12 to 14, the class interval has been listed 
in two ways. (1) It has been described by writing on each line 
the lower limit of the class interval, followed by a dash. (2) It 

Table 13 . — Frequency Distribution of the Heights of 300 Princeton 
Freshmen, Class of 1943 


Heights of freshmen 
-Y 

Number of freshmen having 
specified heights 

Interval 

Mid- value 

F 

61 , 125 - 

61.625 

1 

62 . 125 - 

62.625 

1 

63 . 125 - 

63.625 

1 

64 . 125 - 

64.625 

2 

65 . 125 - 

65.625 

4 

66 . 125 - 

66.625 

20 

67 . 125 - 

67.625 

30 

68 . 125 - 

68.625 

35 

69 . 125 - 

69.625 

47 

70 . 125 - 

70.625 

51 

71 . 125 - 

71.625 

42 

72 . 125 - 

72.625 

27 

73 . 125 - 

73.625 

20 

74 . 125 - 

74.625 

9 

75 . 125 - 

75.625 

7 

76 . 125 - 

76.625 

1 2 

77 . 125 - 

77.625 

1 

300 
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Tabus 14. — ^Fbequenct Distribution of the Heights op 300 Princeton 
Freshmen, Class of 1943 


Heights of freshmen 

X 

Number of freshmen havihg 
specified heights 

F 

Interval 

Mid-value 

62 - 

62.5 

1 

63 - 

63.5 

2 

64 - 

64.5 

1 

66 - 

65.5 

4 

• 66 - 

66.5 

12 

67 - 

67.5 

31 

68 - 

68.5 

31 

69 - 

69.6 

47 


70.5 

48 

71 - 

71.5 1 

42 

72 - 

72.5 

35 

73 - 

73.5 

21 

^ 74 - 

74.5 

14 

75 - 

75.5 

8 

76 - 

76.5 

2 

77 - 

77.5 

1 
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has been described by writing in the next column the mid-value. 
Obviously, both methods of describing the class interval need 
not always be employed; the conventional procedure is to use 
the lower-limit description rather than the mid-value descrip- 
tion. The mid-value can always be calculated by adding one- 
half the class interval to the lower limit of the class interval. 

WORK SHEET FOR FREQUENCY-DISTRIBUTIOH ANALYSIS 

The frequency distribution having been constructed, the 
procedure for frequency-distribution analysis will now be 
described. Table 15 is a work sheet for the analysis of a fre^ 
quency distribution; in columns (1) and (2), under X and F, 
is copied the frequency distribution from Table 14. Entries 
in the remaining columns will be explained below. The work 
sheet is so constructed that advantage may be taken of certain 
economies in calculation. These economies arise from two 
sources: (1) the reduction in calculation, due to the use of a short 
method that involves the calculation of the moments about an 
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‘'arbitrary origin and (2) a reduction in calculation, due to 
the use of class intervals as units of deviation from the arbitrary 
origin* 

Saving Calculation by Obtaining Moments about an Arbitrary 
Origin, In applying the short method, an arbitrary origin, 
which may be called A, is selected. While zero may be taken 
as an arbitrary origin (and often is in certain statistical problems), 
in the analysis of frequency distributions the amount of calcula- 
tion is reduced by selecting a value for A somewhere near the 
middle of the range. The moments about the arbitrary origin 
are then calculated by measuring deviations from A in class- 
interval units, that is, in d/i’s. Sometimes d' is used to symbolize 
d 

j- The savings in calculation are due to the fact that all desired 

mathematical statistics can then be computed by the use of 
formulas from the four moments about the arbitrary origin. 

Saving Calculation by Using Class-interval Units, Saving in 
the amount of calculation to obtain the various statistics results 
if the class-interval unit is used, particularly if the variable is in 
complex or fractional units or in large numbers. This economy 
is brought about by expressing the deviations in terms of class 
intervals instead of in original units, i.e., in d/i’s instead of in 
d’s. As pointed out above, this saving is augmented by selecting 
the arbitrary origin near the middle of the frequency distribution. 
If the arbitrary origin is at or near the middle class interval, the 
largest deviation in terms of class-interval units will then be no 
greater than half the number of class intervals in the frequency 
distribution. Since the deviations must be raised to the fourth 
power in order to calculate the fourth moment, substantial saving 
in calculation is secured by keeping class-interval deviations as 
small as possible by placing the arbitrary origin near the middle 
of the frequency distribution. It will be observed in Table 15 
that the frequency distribution Has been copied on the work 
sheet in such a position that the arbitrary origin is near the middle 
of the frequency distribution. It can also be seen that, when 
the class interval is uniform in size, recording the class-interval 
deviations in column (3) is merely a matter of proceeding by 
count above and below the arbitrary origin, that is, —1, —2, 
—3, etc., for successive smaller class-interval values, and 1, 2, 3, 
etc., for successive larger class-interval values. 
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Entering the Frequency Distribution on the Work Sheet. The 
frequency distribution of freshmen heights shown in Table 14 
has 16 class, intervals; and if the mid-value of the centrgj. class 
interval is to be selected as the arbitrary origin, the first class 
interval, 62-, will be entered in column (1) under Interval^' 
on the line opposite —7 in column (3) {d/i = — 7). The remain- 
ing class intervals will be entered in succeeding lines until 77- 
will be opposite 8 in column (3) {d/i = 8). The mid-value 
of the central class interval is 69.5, which is opposite 0 in column 
(3) {d/i = 0). The corresponding frequencies are then entered 
in column (2): Full description of the data and their source 
is entered in the space provided at the top of the work sheet. 

Saving Calculation in Use of Work Sheet. The amount of 
calculation involved in the entries required for columns (4) to 
(9) can be reduced to a minimum by the following procedure: 

In column (4), headed F{d/i)y enter the class-interval devia- 
tions multiplied by the frequencies [z.e., items in column (3) 
multiplied, respectively, by items in column (2)]. The algebraic • 
sum of the figures in column (4), divided by N, equals the’ 
first moment (in class-interval units) about the arbitrary 
origin. 

The figures in column (5), headed F{d/i)^j are obtained by 
multiplying the items in column (4), respectively, by the corre- 
sponding items in column (3). The sum of figures in column 

(6) , divided by iV, equals the second moment (in class-interval 
units) about the arbitrary origin. 

The figures in column (6), headed F{d/iY, are most easily 
obtained by multiplying the items in column (5), respectively, 
by the corresponding items in column (3). The algebraic sum 
of figures in column (6), divided by N, equals the third moment 
(in class-interval units) about the arbitrary origin. 

The figures in column (7), headed F{d/iy, are obtained by 
multiplying the items in column (6), respectively, by the corre- 
sponding items in column (3). The sum of figures in column 

(7) , divided by N, equals the fourth moment (in class-interval 
units) about the arbitrary origin. 


The figures in column (8), headed 



are obtained 


by adding 1, respectively, to each figure in column (3) and raising 
the result to its fourth power. All figures in this column are 
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readily obtained by. using a table of powers of numbers.* The 
sum of column (8) is not used. 

The* figures in column (9), headed 7^’^^ + 1^ > are obtained 

by multipl 3 dng the items in column (8), respectively, by corre- 
sponding items in column (2). The sum of column (9) is used 
to check the arithmetical accuracy of all calculations in the 
work sheet. ' 

When the work sheet is completed, it will show the following 
values: 


A, i, N, 2F 


(O’ "KO’’ 


and 2F 



In addition, by means of columns (8) and (9), the work sheet 
provides a cross check on its internal calculations, since the 
/d . 

expansion of 2F I ^ ) gives the following terms: 

XF (i)‘ + iXF (i)‘ + 6Zf (ff + iXF (i) + XF 


It follows that on a correctly ^ constructed work sheet the sum 
of column (9) equals the sum of column (7) plus four times the 
sum of column (6) plus six times the sum of column (5) plus four 
times the sum of column (4) plus the sum of column (2). This 
is called a ^'Charlier check after the name of the man who first 
suggested its use as a checking device. 

For Table 15 the Charlier check is as follows: 


2 [column (2)] = 300 

42 [column (4)] = 4 X 292 = 1,168 
62 [column (5)] = 6 X 2,140 = 12,840 
42 [column (6)] = 4 = 5,590 = 22,360 
2 [column (7)] == 45,088 

Sum = 2 [column (9)] = 81,756 


^ Cf. Mathematical Tables from Handbook of Chemistry and Physics^ pp. 
153“ 173. For use in making calculations there are a number of convenient 
devices such as the slide rule and calculating machines, as well as logarithms. 
There are also several useful printed tables such as Barlow's Tables of 
Squares^ Cubes^ Sqiuir e-roots ^ Cube-roots^ and Reciprocals of Integers up to 
10,000 and Karl Pearson's Tables for Statisticians and Biometricians; A set 
of logarithms will be found in Appendix, Table I. 
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Table 16. — Work Sheet for Making Calculations in the Analysis 
OP A Frequency Distribution 

Description op Data: Heights of 300 Princeton University Freshmen, 
Class of 1943 ^ 

Source op Data: Princeton University's Department of Health and Physi- 
cal Education 

f — 1 in. 



Colttmiw (8) and (9) are for checking purposes. [2 column (9)] — EfColumn (2)] + 
4E[Celumn (4)] + SECColumn <6)1 + 42[Column (0)] + 2[Coluixm (7)]. 
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Moments abovi the Arbitrary Origin. The moments about an 
arbitrary origin can be quickly calculated from the sums of 
columns (4) to (7), because by definition the moments about an 
arbitrary origin are as follows: 

SFd 

ZFd? 

■LFd* 

“ JV 

XFd” 

"" N 

where X — A = d. 

If A were zero, d would equal X ; and the moments would then 
reduce to the form shown in Chap. VI. 

When, as in Table 15, the deviations have been taken in class- 
interval units rather than in original units, the formulas for the 
moments about an arbitrary origin, would be written as follows:^ 



where X — A = ^ (t), in which i is the class interval. 


^ Cf, p. 181. The prime on v means that the vis in class-interval units; 
i.e., p' » v/ij v\ =* V2A*, etc. 
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Accordingly, the moments in class-interval units about an 
arbitrary ori^n are obtained from the sums of columns (4) to (7) 
of the work sheet by dividing each by N [the sum of column (2)]. 

In Table 15, the moments about the arbitrary origin in class- 
interval units are as follows: 




*'3 


, _ 5,590 _ 


300 


18.63333 


^ 150.29333 


Moments about the Arithmetic Mean, When the moments 
about an arbitrary origin are obtained, the moments about the 
mean are obtained from the following equations:^ 

Ml = J'l — J'l = 0 
t f f2 

M 2 = >^2 - Vi 

Ms = »'3 ~ + 2/,’ 

fii= v[ — + Gv'iv'i — Sv'i 



The moments about the arbitrary origin having been calcu- 
lated for the frequency distribution of freshmen heights in 
Table 15, the moments about the arithmetic mean in class- 
interval units may now be obtained by the use of Eqs. (2), as 
follows: 

fii = 0.97333 - 0.97333 = 0 

M* = 7.13333 - (0.97333)2 = 7.13333 7- 0.94737 = 6.18596 

Mi = 18.6333 - 3(7.13333)(0.97333) + 2(0.97333)» 

= 18.6333 - 20.82924 + 1.84420 = -0.35171 
Mi = 150.2933 - 4(18.63333) (0.97333) + 6(7. 13333) (0.^333)* 

- 3(0.97333) < 

= 150.29333 - 72.54552 -f 40.54740 - 2.69253 = 115.60268 

Equations (2) for finding the moments about the mean from 
the moments about an arbitrary origin may be proved as follows: 


' Ml “ MlA'. Ms “ “s/t*, «» “ Ml/**, etc. 
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Since, in Eqs. (1) for moments about an arbitrary origin, 
i(4/i) = X — it follows that 


Fi i = FriXx - A) = FtX, - F,A 

Fi i = F^(Xi - A) = FiXi - F^A 


Fn 



F„(X„ - A) = E„X„ - 


By adding, 

(a) EF i = IFX - N A 

since ZF = N, 

Because yl is a constant, di, ^ 2 , . . . , will vary in propor- 
tion as X\y X 2 , . . . , Xn vary. Also, since A is a constant, the 
sum of the A^s may be written as the constant multiplied by the 
total frequencies, or N A, If now Eq. (a) is divided by iV, 


(&) 



(•) - -y,- 


But, by definition. 


and 



{i) = v'S) = vi 


^X 

N 


= X, the arithmetic mean 


Therefore, by substitution and transposing, Eq. (b) becomes 

X = A + v[{i) or X = A + Pi (3) * 

Accordingly the arithmetic mean of the frequency distribution 
of 300 freshmen heights shown in Table 15 is as follows:^ 

^ The result of calculation is 70.47333; but since the beginning data , were 
significant t6 only two places beyond the decimal, the figures beyond .^7 
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J? s= 69.6 + 0.97333, since » = 1 
= 70.47 in. 

It has thus been proved that the arithmetic mean equals any 
arbitrary quantity plus the first moment about that arbitrary 
.quantity. In- other words, the arithmetic mean of a series of 
magnitudes is equal to any arbitrary quantity plus the mean of the 
deviations from the arbitrary quantity. From Eq. (3) and from 
the fact that d = X — A, ii follows that A — X — d and that 
^ — X — d -]r v\. Therefore, X — X = d — v\, and 

(c) X = d — vi 

or if d is in class-interval units. 

This value for x may be substituted in the equations defining the 
moments about the mean, as follows:^ 



are not significant. The manner in which the figures are written in Table 
11, which was taken from the source of the data, indicates accuracy to two 
decimal places. Had the numbers been rounded off to the nearest inch, 
the calculated mean would have significant figures to the nearest inch. 
Nevertheless, if the value of the mean is to be used for making further mathe- 
matical calculations to obtain other statistics, it should be carried out to 
several more decimal places in order to give an accurate result to two places 
in tfi^ additional statistics. 

^ For definition of moments about the mean, c/., p. 181. 




fxi = Pi — Vi = 0 
M2 = *'2 — vl 

flz ^ Vz — 3v2V1 “}■ 2 j'i 

M4 = ^4 — 4:V3Vi + dv2Pl — Sj'J 

which it was said at the beginning of this section would be proved. 
An important corollary follows from the above derivation of the 
second moment (or “variance/’ as it is sometimes called). 
Since 


M2 = V2 — vl 

it follows that the mean square deviation about the mean of the 
observations is less than the mean square deviation about any 
arbitrary quantity; that is, the mean square deviation (<r) about 
the mean is a minimum — smaller than it would be if calculated 
from any other average. This is obvious from the equation; 
since v\ is a positive quantity, being a square, m 2 must be less 
than V2. 

The Standard Deviation, The standard deviation about the 
arithmetic mean may now be quickly calculated, since it is by 
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definition the square root of the second moment. For the 
frequency distribution of heights of 300 freshmen the standard 
deviation is as follows: 

= M 2 '* = 2.487 or 2.49 in. 

Since the moments were calculated in class-interval units (see 
page 212), this result is also in class-interval units. The standard 
deviation in original units is found by multiplying by i. In the 
present problem, i = 1 ; hence, <r = a/i = 2.49 in. 

The Beta Coefficients, For the frequency distribution of heights 
of 300 freshmen, the first two beta coeflScients are as follows: 

= fi| = 0.00052 
M2 

^2 = -, = 3.02102 
M2 

Since the betas are ratios having i raised to the same power 
in both numerator and denominator, the fact that the moments 
are in class-interval units instead of original units may be 
disregarded. 

Measures of Skewness and Kurtosis, Measures of skewness 
and kurtosis are also readily determined from the moments about 
the mean. In the frequency distribution of heights of 300 
freshmen, the measure of kurtosis, P 2 , calculated above, is 3.021, 
slightly larger than 3. Hence the frequency distribution is 
somewhat less flat-topped than the normal curve. ^ 

Skewness in heights of the 300 freshmen, measured by the 
cube root of the third moment, is —0.7057 class intervals. 
Since i == 1 in., this is —0.7057 inch. 

CALCULATION OF OTHER STATISTICS 

Averages and Measures of Variability. Difficulties in Locating 
the Median and the Mode, Consideration of the median, ihe 
mode, and the quartiles has been left to the last for the reason 
that, in the analysis of frequency distributions with class inter- 
vals, these values must be estimated. By definition, the median 
is the value at the center of the distribution, the first quartile 

* Figure 101, p. 295, is a graph comparing the frequency distribution 
with the ideal normal curve. 
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is the value midway between the lower limit of the range and the 
median, and the third quartile is the value midway between 
the niedian and the upper limit of the range. The mode is the 
value that occurs with the greatest frequency. The calculations 
of these statistics are not based on the work sheet shown in 
Table 16. 

Because they are concealed in the class interval among a group 
of other cases in the same class interval, the quartiles, the median, 
and the mode must be obtained by estimation. Where within 
the range of the class interval is the median? Where within 
the range of the class interval with the largest frequency is the 
mode? These questions have to be answered by interpolation, 
and the value so obtained becomes an abstract quantity — as 
abstract and mathematical in character as the mean, but without 
the latter’s precision. 

The Mode. In the case of the mode, a further difficulty arises 
in finding the correct answer to the question: Which class 
interval should be considered to contain the mode? If different- 
sized class intervals are taken in each of several frequency dis- 
tributions of the same data, the modal class interval will be 
observed to shift around. The mode, by definition the simplest 
of the several measures, is actually the most difficult average 
to locate. Its accurate computation is more highly mathematical 
than that of the arithmetic mean. If a Pearsonian curve gives a 
good fit to the data, the ideal method of obtaining the mode is to 
find the mode of this curve. A formula for this is given on the 
next page. The disadvantage of this method is that there is no 
way of telling whether a curve is a good fit or not until it is 
actually fitted, and this involves a considerable amount of calcu- 
lation just for the sake of finding the mode. 

But simpler measures of the mode are often used. These 
are interpolated values, on the assumption that the mode lies 
in the modal class interval, that is, the class interval that has 
the highest frequency. It is assumed that the general shape 
of the distribution affects the distribution of cases at the point 
of greatest concentration in the following manner: All the fre- 
quencies below the modal class interval are pulling the mode near 
the lower limit of that class interval, and all the frequencies 
above the modal class interval are pulling the mode toward the 
upper limit of the interval. The mode is equal to the Idwer 
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limit of the modal class interval plus the interpolated part of 
the class interval established by the relationship of the frequencies 
above and below that class interval. In the frequency distribu- 
tion of freshmen heights (Table 16), the modal class interval, 
that is, the class interval with the greatest concentration of cases, 
is 70~. There are 129 cases pillling the mode toward the lower 
limit of the class interval 70-, and 123 cases pulling the mode 
toward the upper limit. Consequently, 

Mo = 70 + iM X 1 = 70.488 or 70.49 in. 

The so-called mathematical mode,^^ an approximation of the 
mode of the Pearsonian curve that is invalid if the frequency 
curve is very skewed, is calculated from the following equation:^ 

Mo = Z - 3(Z - Mi)* 


For the frequency distribution of 300 freshmen heights, the 
mathematical mode is calculated as follows: 

Mo = 70.47333 - 3(70.47333 - 70.4375) = 70.366 or 70.37 in. 

The mode of the Pearsonian curve fittecl to the data is given 
by the formula: 


Mo = Z — ask 


where sk = 


Vfi (/32 + 3) 
2(5|82 - 6/3i - 9) 


For the frequency distribution of 300 freshmen heights, the 
mode calculated by this equation is as follows: 


Mo = 70.50 in. 


The Median and (he Quartiles, Determination of the median 
and the quartiles by interpolation is reasonably accurate if, as 
it is assumed, the cases are evenly distributed within the class 
interval containing the median and the two quartiles, respec- 
tively. The calculation of the median and the quartiles is 
facilitated by making a colunm of cumulated frequencies as 
shown in Table 16. The median is equal to the lower limit of 
the class containing the iV/2th case plus an interpolated amount 
within the class interval determined by the ratio of the fre- 
1 C/, p. 173. 

♦ The median is 70.4375. C/. the next section. 
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quencies in the interval to the balance of frequencies necessary 
to make up N/2 frequencies. In the frequency distribution of 
freshjaen heights (Table 16), N/2 =■ 150. The frequencies 
are counted cumulatively from the lower limit of the first class 
interval (top of the table). By this count, there are 129 cases 
to the lower limit of the class interval 70~. When the point 
70 is reached on the quantity scale, 129 cases have been counted; 
but the median is the value of the 150th case, that is, 21 cases 
beyond 70. From 70 to 71 there are 48 cases. Consequently, 
the ratio of interpolation within the class interval is Accord- 
ingly, the estimate of the median in freshmen heights is as 
follows: 


Mi = 70 + X 1 = 70.4375 or 70.44 in. 

Table 16. — Frequency Distribution of the Heights of 300 Princeton 
Freshmen, Class of 1943 
(In inches. Class interval 1 in.) 


X 

F 

Cumulative F 

62- 

1 

# 

1 

63- 

2 

3 

64- 

1 

4 

65- 

4 

8 

66- 

12 

20 

67- 

31 

51 

68— 

31 

82 

69- 

47 

129 

70- 

48, 

177 

71- 

42 

219 

72- 

35 

254 

73- 

21 

275 

74- 

14 

289 

75- 

8 

297 

76- 

M 2 

299 

77- 

1 

300 


300 



1. There are 129 cases to X * 70. 

2. Since N/2 — 150.0, this leaves 150.0 — 129, or 21.0 cases to go, of the 
48 cases in the next class interval (70-71). 

3. The interpolated amount of the class-interval range is therefore 
ii XI. 
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The third and first quartiles are calculated by interpolating 
in a similar manner for the values of the 3iV'/4th and the iV’/4th 
cases. In the frequency distribution of the heights of 300 
freshmen, following are the values of the quartiles: 

= 68 + M X 1 = 68.774 or 68.77 in. 

Os = 72 + ifir X 1 = 72.171 or 72.17 in. 


The Average Deviation. The average, or mean, deviation 
is a measure of dispersion that has its minimum value when 
deviations are measured from the median. To compute the 
average deviation from the median, subtract each of the N values 
of X from the median, add the absolute values of the deviations, 
and divide by N. Thus, 


A.D. 


S|Z - Mil 
N 


( 6 ) 


The average deviation is simpler in concept than any other 
measure of dispersion. It is less affected by extreme deviations 
than the more popular standard deviation, and for this reason 
it probably has greater sampling reliability from extremely 
leptokurtic universes. In spite of these advantages the average 
deviation is not a popular measure of dispersion, partly because 
of several widely accepted but mistaken notions concerning its 
properties. 

It is often said that it is illogical to neglect the signs of devia- 
tions to be averaged and that this fallacy is avoided in the case 
of other measures of dispersion. It is true that the mean devia- 
tion from the median is the mean of absolute deviations from 
some average, but every other measure of dispersion is also equal 
(or proportional) to an average of absolute deviations from some 
average. The quartile deviation^is the median of absolute 
deviations from the mid-quartile, and the standard deviation 
is the quadratic mean of absolute deviations from the mean. 

It has been said that the sampling reliability of the average 
deviation is less than that of the standard deviation. This may 
be true for normal universes, but it can hardly be true for all 
types. 

Grouped Data — Mid-value Assumption in Calculating Average 
Deviation. When data are grouped in the form of a frequency 
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distribution with equal class intervals, the average deviation can 
be written in the simple form 


% 


,A.D. 


i{AW\) 

N 


( 7 ) 


where d is the deviation of class mid-values from the mid-value 
of the class interval containing the median. Although Eq. (7) 
is the exact value of the average deviation from the median 
according to the assumption that all observations in every class 
interval are equal to the mid-value of the interval (the same 
assumption commonly used for the standard deviation), many 
statisticians consider it unsatisfactory as a formula for the 
average deviation. The chief reason for the dissatisfaction seems 
to be that the mid-value assumption, which implies that the 
median is the mid-value of the median interval, is inconsistent 
with the ordinary notion of the interpolated median. 

In applying the simple formula in practice, several corrections 
may be used, some of which will be illustrated below. Each of 
these corrections deals with a separate aspect of the problem of 
approximating the average deviation of ungrouped data from a 
frequency distribution. The two most important corrections 
are usually of the same order of magnitude, but opposite in sign, 
so that they tend to offset each other. For this reason, it is 
usually advisable to use the simpler formula without correction, 
because of its simplicity, unless the problem is of great importance 
so that minor adjustments are worth making. 

Grouped Data — Histogram Assumption in Calculating Average 
Deviation, The average deviation of the histogram considered 
as a continuous frequency function is often used in preference 
to the simple formula for the average deviation presented in Eq. 
(7). This corresponds to the assumption on which the usual 
interpolated median is based. The median is the abscissa of 
the vertical line that divides the histogram into two equal 
areas. When the left half of the histogram is folded along this 
vertical line, over the half on the right, the average deviation 
is the first moment about the line of folding. 

To simplify the derivation, let d/i represent deviations from 
the mid-value of the median interval, and let 


Mi — L + ci 


( 8 ) 
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where L is the lower limi t of the median interval, i is the width 
of the class interval, and c is the proportion of observations in 
the median interval that are less than Mi. It is to be ifpted 
that the cases are assumed to be distributed uniformly through 
the interval. 

In these terms the formula for average deviation can be written 
as follows: 


A.D. = i 



■\r Cl Ci 


+c(l -c)Fo 


(9) 


in which Fo is the frequency of the median interval, Ci is the 
amount of correction associated with observations above and 


Y'cFoobserved 
below limit 

1 1 

ctb \(i-c)Fo j iFo 

1 1 

7-/0 d-c) observation 
above upper limit 




ci I (i-cji 

Mi t 


M/dvalue 
of interval 

Fig. 74 . — Illustration of distribution of cases in and above and below the median 

interval. 

below the median interval, and C 2 is the amount of correction 
associated with the median interval itself. 

To demonstrate the truth of this equation, consider the 
diagram of the median interval shown in Fig. 74. Since devia- 
tions from the mid-value of the median interval are (i — c)i 
too small for observations above the median interval and (i — c)i 
too large for those below the median interval, it follows that 

Cl = t - c) [f - (1 - c)F« - 0 - cF.)] (10) 

= i(i - c)(2c - l)Fff = tFo(2c - 2c* - i) 

The area in the median interval below Mi is cFo, and its mean 
deviation from Mi is ci/2. Similarly, (1 — c)Fo lies above Mb 
with a mean deviation of (1 — e)i/2. Hence, 

C* = cF„(^) + (l -0)1(1 -c)F. 



( 11 ) 



FKEQUENCY-DISTRIBVTION ANALYSIS 223 

From Eqs. (10) and (11), the combined corrections are found to be 

Cl + C* = tFo(2c - 2c* - i) + fFo(c* - c + i) 

• = iF o(c — c*) = iFoc(l — c) (12) 

a result that verified Eq. (9). Equation (9) is probably the most 
convenient form available for computing the mean deviation 
according to the histogram assumption. 

Calculation of the average deviation by the use of Eq. (9) 
is illustrated by Table 17 and the ensuing analysis. 


Table 17. — Frequency Distribution of the Heights of 300 Princeton 
Freshmen, Class of 1943 
(In inches. Class interval 1 in.) 


X 

F 

d 

i 


62- 

1 

-8 

-8 

63- 

2 

-7 

-14 

64- 

1 

-6 

-6 

65- 

4 

-5 

-20 

66- 

12 

-4 

-48 

67- 

31 

-3 

-93 

68- 

31 

-2 

-62 

69- 

47 

-1 

-47 

70- 

48 

0 

0 

71- 

42 

1 

42 

72- 

35 

2 

70 

73- 

21 

3 

63 

74- 

14 

4 

56 

75- 

8 

5 

40 

76- 

2 

6 

12 

77- 

1 

7 

7 


300 


-298 

+290 

S (without regard 
to sign) »* 588 


When the median and the quartiles were calculated, it was 
assumed that the frequencies were evenly distributed in the class 
intervals. This assumption is continued while calculating the 
average deviation about the median. As shown in Table 17, 
the sum of the deviations about the arbitrary origin, without 
regard to sign is 588. That is, 
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where d\ = Xi — A 

- A 

d, = i„ - A 


The sum desired is the sum without regard to sign of deviations 
from the median. That is, 



where a:J = Xi — Mi 

xi = X 2 - Mi 

x;: = X„ = Mi 

Note: x has been used to symbolize the deviations from the arithmetic 
mean; x' is used to symbolize deviations from the median. 

Accordingly, the above sum, 688, which for the illustration 
chosen is s|^’(d/t)l can be adjusted by a calculated correction 
that will change the sum to slF(xVt)l. This correction is 
obtained by using Eq. (9). 

From Table 17 and the analysis on pages 221 to 223 it is to be 
noted that Fo — 48, the frequency of the interval containing 
the median; t = 1; and c = 0.44, since the median is 70.44, the 
lower limit of the interval containing the median is 70, and c is 
the proportion of observations in the median interval that are 
less than the median. Accordingly, the average deviation may 
be calculated by using Eq. (9), as follows: 


i|2:|74 +c(1 -c)Fo 
..D.m. . 

_ 1[588 + 0.44(0.66)48] 
300 

_ 688 + 11;83 
300 
= 2.00 in. 


The Semiquartile Range. The semiquartile range, one-half 
of the differehce between the third quartile and the first quartile. 
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is another statistic that measures variability. Its formula is 

% ^ Q3 Qi 

^ 2 

For the frequency distribution in Table 16, the semiquartile 
range is calculated as follows: 

^ 72.17143 - 68.77419 , „ . 

Q 2 == 1.69862 or 1.70 m. 

Measures of Skevmess. From statistics measuring variation 
and central tendencies, important measures of skewness are 
obtained. It has been noted that X — Mo is a measure of 
skewness. In the frequency distribution of 300 Princeton 
freshmen heights, 

X - Mo = 70.473333 - 70.36584 = 0.10749 or 0.11 in. 

The position of the first and third quartiles in relation to the 
median is a very convenient statistic measuring skewness, 
namely. 


Q 3 — Mi — (Mi — Qi) or Qi + Qs “ 2Mi 

t 

For the frequency distribution of heights of 300 Princeton 
freshmen this statistic is 

68.7742 + 72.1714 - 2(70.4375) = 0.07 in. 

COEFFICIENTS OF VARIABILITY 

The various aggregative measures of variability may con- 
veniently be expressed as relatives or coefficients, as explained 
in the preceding chapter; indeed, they must be so expressed if 
comparisons are to be made with other frequency distributions 
having different types of units. The aggregative measures of 
variability are converted into relatives or coefficients by dividing 
the former by the mean, the median, or the average of the two 
quartiles. For the present problem, the relative measures of 
variability that would be useful in comparing this frequency 
distribution with other frequency distributions, are as follows: 
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V, = 4 = 3.63 per cent 
X 

AD ^ 

Fa.d. = ’ = 1.38 per cent 

^0 — ^ I == 2.41 per cent 

Wi + Vs 

# 

The formula for the Vq is really the semiquartile range divided 
by the average of the two quartiles, but the 2’s cancel out, leaving 
merely the difference between the two quartiles in the numerator 
and their sum in the denominator. 


COEFFICIENTS OF SKEWNESS 

Statistics measuring skewness are likewise more significant 
for comparative purposes when expressed as coefficients. The 
various coefficients of skewness for the frequency distribution 
in Table 16 are as follows: 

Based on mathematical statistics: 


, M 3 —0.32764 in. io 

sk = — = - n ;iq> 7 ia * — = —0.1317 or —13.17 per cent 
<r 2.4o716 in. 

^ = 2(^ or -1.12 per cent 

Note: This is given tiie negative sign because the third moment is 
negative. 

Based on other statistics (using Mo = 70.488) : 


sk = 


<r 


-0.015 in. 
2.487 in. 


—0.006 or —0.6 per cent 


(If the so-called ‘‘mathematical mode,'' i.e., Mo « 70.366, is used, this 
coefficient of skewness by the same formula would be 4-4.32 per cent.) 

Using the median and the two quartiles to measure skewness, 
the following result is obtained: 


sk 


sk = 


X-Mo +0.00477 in. ^ 

ff ' 2 487i6 in “ —0.00191, or +0.19 per cent.) 


Qt + Qi- 2Mi +0.0706 in. 


Q 


1.69862 in. 


= +0.0416, or +4.16 


per cent 


The difficulty of locating the mode, even when quite a large 
sample is taken, is illustrated by the frequency distribution 
anSyzed in this chapter. In this illustration every mathematical 
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indication is that the mode is largq;r than the mean, but the i^on- 
mathematically calculated mode (the interpolated mode) is 
smaller than the mean. 

GRAPHIC INTERPRETATION OF STATISTICS OF VARUBILITY 

AND SKEWNESS 

Figure 75 shows on a scale the relative location of the median, 
the two quartiles, and the upper and lower limits found by taking 
Mi ± A.D.Mi, namely, 70.44 ± 2.00. The figure illustrates the 
fact that, when there is skewness, the location of the quartiles 
with reference to the median reflects the presence of skewness. 
If, therefore, the quartiles are used as measures of deviation, 
they reflect the fact that, in skewed distributions, the deviation 



68.77 72.17 

Fig. 75. — Illustration of significance of average deviation and two quartiles as 
measures of dispersion. 


is skewed iirone direction or the other. If the average deviation 
is used as a measure of deviation or variability, the presence of 
skewness will not be noted in the results. Whenever the 
distribution is skewed to any extent, the quartiles are unequal 
distances from th^median, as may be noticed in Fig. 75. As the 
figure also illustrates, the average deviation is conceived as an 
equal distance on each side of the median. 

Figure 76 shows on a scale the relative location of the median 
and average deviation and the location of the mean and the 
standard deviation by depicting the upper and lower limits of 
Mi ± A.D.Mi (as in Fig. 75) and, in addition, the upper and 
lower limits of .X* ± cr. As in the case of the average deviation, 
so also in the case of the standard deviation, the measure of 
variability is conceived as an equal distance above and below 
the mean — that is, an equal distance froipci the mean on the 
a:-axis in both the positive and the negative directions in Fjgs. 



228 ANALYSIS OF FREQUENCY DISTRIBUTIONS 

75 and 76. If the distribution is skewed to a marked extent, it 
should be evident that care must be exercised in interpreting the 
significance of the average deviation or the standard deviatic/xi. 

From Fig. 76 it is noted that the first quartile in the negative 
direction and the third quartile in the positive direction are less 
distant from the median than ±A.D.Mi. By definition, the 
limits of the range between the first and third quartiles- include 
exactly 60 per cent of the cases. For a normal distribution^ the 
distance between the upper and lower limits defined by 
Mi ± A.D.Mi include approximately 58 per cent of the cases.^ 



70.41 72.96 


Fio. 76. — Illustration of the standard deviation and average deviation as meas- 
ures of variability. 

It will be noted from Fig. 76 that the limits X ± <r are farther, 
respectively, in the positive and negative directions from the 
mean than are the limits Mi ± A.D.mi from the median. The 
standard deviation is always larger than the average deviation; 
in fact, an approximate check* on the accuracy of calculation may 
be used as follows: A.D. = 0.8<r. In the frequency distribution 
illustrated, this check works fairly well; for 0.8(2.49) = 1.97 and 
the calculated A.D.mi = 2.00. For a normal distribution the 
distance between the upper and lower limits defined hy X ± a 
includes approximately two-thirds (68 per cent) of the cases.* 

FREQUENCY DISTRIBUTIONS WITH UNEQUAL CLASS INTERVALS 

As remarked earlier in this chapter, the size of the class interval 
should ordinarily be uniform throughout a given frequency 

^ See Chap. XI for description of a normal distribution. 

* For niore precise discussion and explanation, see Chaps. XI and XII. 

• For distributions that depart widely from the normal form, this check 
mtig not be satisfactory. 
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distribution; but in some cases, usually because there is a large 
concentration of cases at one or the other extreme of the range, 
it i^onsidered necessary to use different-sized class intervals for 
parts of the frequency distribution in order to give a proper 
picture of variability. Table 18 illustrates such an instance. 
Of 150 cases distributed over the range 0-51, 106 cases fell within 
the limits 0-10. Obviously, a small number of class intervals 
of uniform size would give a wholly erroneous notion of the 
variation. Occasionally, data at its primary source will be 
published in a manner similar to that of Table 18, and the 
statistician has no choice but to utilize the material in frequency 
distributions that have unequal-sized class intervals. This is 
particularly true of statistics of wages and income and statistics 
of hours of labor. 


Table 18 . — Deaths Due to Automobile Accidents in 150 Cities,* 
First 20 Weeks of 1940 


Number of deaths due to automobile | 

accidents 

X 

Number pf cities 
whose automobile 
accident fatalities 
were as specified 

' F 

Calculations .of 
deviations from an 
arbitrary origin 
(A - 16) 

i 

Intervals 

Mid-values 

0 — 

0.5 

11 

- 1.45 

1 - 

1.5 

23 

- 1.35 

3 - 

4.0 

34 

- 1.10 

5 - 

7.5 

38 

- 0.75 

10 - 

15.0 

24 

0 

20 - 

25.0 

12 

1.00 

30 - 

35.0 

4 

2.00 

40-51 

45.5 

4 

3.05 



150 



New York, Los Angeles, Chicago, and Detroit are excluded from these statistics. 
United States Bureau of the United States Census, Weeklv Accident Bulletin, May 24, 1940, 
pp. 1-4. 
ti-10. 

If the mid-value of the class interval 10- is taken as the 
arbitrary origin, that is, A = 15, and the ‘‘class interval’' or 
abscissa scale unit i is taken equal to 10 (since that size interval 
^redoMnates), the deviations of class intervals in that part of the 
frequency distribution where class intervals are equal are readily 
determined. Where the class intervals are unequal, simple sub- 
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traction of mid-values and the division of the answer by the 
scale unit gives the results in the last column of Table 18. To 
illustrate the process, there is a difference of 10.5 between mid- 
value 36 and mid-value 45.5, a quantity 1.05 times the scale 
unit. Accordingly, the deviation advances from 2.0 to 3.05 
scale units. In the lower reaches of the range there is a difference 
of 7.5 between mid-value 15 and mid-value 7.6, or | a scale unit; 
consequently, the step-deviation change is from 0 to —0.76. 
From mid-value 7.5 to mid-value 4, the deviation recedes 0.35 
a scale unit to —1.10. From mid-value 4.0 to mid-value 1.6, a^ 
distance of \ an interval, the scale-unit deviation changes from 
— 1.10 to —1.36. Finally, from mid-value 1.5 to mid-value 0.5, 
a distance of a scale unit, the scale-unit deviation recedes 
from —1.35 to —1.45. 

From this point on, the analysis of the frequency distribution 
is the same as it would be were uniform class intervals used, 
although obviously the uneven numbers add somewhat to the 
burden of filling in the work sheet according to the plan shown in 
Table 16. Once the work sheet has been completed, however, 
the fact that the class intervals are not uniform ceases to be a 
consideration in the subsequent computations; the summation 
figures can be applied in the formulas in precisely the same 
manner as if the class intervals were uniform. 

ACCURACY IN THE CALCULATION OF STATISTICS 

Ordinary common sense would dictate that all recording of 
figures needs to be carefully checked, since there is always a 
chance of making a mistake in copying. Such mistakes are not 
statistical errors to be disregarded under the ‘‘theory of errors,^' 
which is explained in Chap, XI. They cannot be disregarded, 
and every effort should be made to prevent their occurrence. 
The same applies to all calculations made, but frequently 
short-cut or cross checks can be devised for these. While 
accuracy is essential, a spurious accuracy may be introduced into 
final answers. For example, in most cases final figures repre- 
senting samples should be presented in round numbers, including 
only the significant figures in the arithmetical answers obtained.^ 

Care must be taken, however, in cases where errors are likely to 
accumulate through successive steps of calculation. It may be 

^ The meanmg of significant’’ is explained on p. 213, note. 
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necessary to retain the figures in a calculated result for a number 
of places beyond the significant figures if that calculated result 
is b&ng used in the process of calciilating other statistics. In 
some statistical problems it is necessary to add a constant 
successively perhaps fifty or even hundreds of times, or, similarly, 
to multiply by a constant successively a large number of times. 
In such instances the constant should be written to several more 
places than will be used in the final answers in order to avoid 
an error in significant figures at the end of the process. This is a 
purely mathematical problem; in every case, the standard of 
accuracy required, or the number of significant figures, having 
been decided upon, a simple arithmetical calculation will show to 
how many places the intermediary calculations must be carried. 
The final results are then rounded off to the number of significant 
figures. 

In rounding numbers the rule is that a remainder less than 
half a unit is disregarded, while half or more than half is counted 
as an additional unit. Exactly half may be changed to the 
nearest even number — thus 174.5 would be 174 but 175.5 would 
be 176. 



PART III 


The Normal Frequency Curve 

CUAPTER VIII 
PROBABILITY 

Up to this point, the discussion has primarily been concerned 
with ^Vdescriptive statistics/' Attention has centered upon 
methods of summarizing and describing statistical variation. 
Occasionally, theory has been employed to explain certain 
methods or to indicate why one method is to be preferred to 
another; but, in general, emphasis has been upon the facts as 
such, rather than upon any theoretical explanation of or inference 
to be made from these facts. 

In contrast, the next four chapters will be primarily concerned 
with a particular body of theoretical statistics, namely, the theory 
of the normal frequency curve. The question now to be con- 
sidered is not ‘^what" is the character of a given frequency dis- 
tribution, but ‘‘why." The discussion will be abstract and 
general and- will not pertain to actual concrete data, except by 
way of illustration. 

Before this theoretical analysis can be undertaken, however, 
certain mathematical tools must be acquired and certain funda- 
mental concepts clarified. That is the purpose of this and the 
next chapters. 

PERMUTATIONS AND COMBINATIONS 

Permutations Defined and Illustrated, A “permutation" is 
an arrangement. The word .“man," for example, is a special 
armugement of the three letters m, a, and n. Other possible 
ai^ngemeiits of these threeTetters are: mno, nma, nam, anm, 
these arrangements are pennUtations., 

23 ? 
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In general, if there are N different things, it is possible to form 
JV^! different permutations.^ Consider again the three letters 
m, and n. In making various arrangements of these, it is 
possible to pick the first letter in three different ways. The 
first letter having been picked, there are then left two different 
ways for the selection of the second letter. Finally, the first 
two letters having been selected, there remains one, and only one, 
way for the selection of the last letter. Now, each one of the two 
ways that are open for the selection of the second letter can be 
combined with each of the three ways that are open for the 
•selection of the first letter, so that there are 3X2 different ways 
of picking the first and second letters. Since there is only one 
way left in every case for the selection of the last letter, there are 
therefore 3 X 2 X 1 = 6 different ways of picking all the three 
letters. Thus, the number of different permutations of three 
things is 3! = 6. If there had been 10 different letters, the 
number of different permutations of these would have been 
10X9X8X7X6X5X4X3X2X1 = 10! = 3,628,800. 

Suppose, now, that among 10 different things 3 are to be 
selected for some particular purpose, the exact nature of the 
purpose being immaterial for the analysis. The question is: 
In how many different ways may a subgroup of 3 be selected 
from the total of 10; in other words, what is the number of differ- 
ent permutations that can be made of 10 things taken 3 at a 
time? This question may be answered as follows: It is possible 
to select the first of the subgroup of 3 in 10 different ways, the 
second in 9 different ways, and the third in 8 different ways. 
There are thus altogether 10 X 9 X 8 different ways in which 
the 3 things may be selected from the total of 10. Accordingly, 
the number of different permutations of 10 things taken 3 at a 
time is 10 X 9 X 8 = 720. In general, the number of different 
permutations of N things taken r at a time is 

= N{N — !)••• to r factors 

that is, 

Pf = N{N - 1){N - 2) — • (iV - r + 1) (1) 

Combinations Defined and Illustrated, A ‘‘combination’^ is 
not the same thing as a permutation. A group of 3 letters con- 

* iV! is to be read “iV factorial” and signifies the successive product of N 
by all the integers less than N and greater than zero. 
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stitutes a combination of th^ 3 letters; but as has just been 
seen, this combination can be arranged ih 31 different ways. In 
other words, it is possible to have 31 permutations of a ^gle 
combination of 3 things. In general, it is possible to have N\ 
permutations of a single combination of N tWgs. 

Although a group of N thiugs forms but a single combmation, 
subgroups may be picked in such a way as to constitute different 
combinations. Suppose, for example, that the board of directors 
of a given corporation consists of 10 men and the chairman. 
The chairman wishes to pick a committee of 3 men. In how 
many different ways can such a committee be constituted, the 
chairman himself being excluded? This is a question of how 
many different combinations of 3 men may be taken from a 
group of 10 men. It will be noted that the order of selection 
is immaterial, for it is only the constituency of any committee 
that differentiates it from other possible committees. 

The answer to this question is obtained as follows: Let Cj® 
represent the number of combinations to be calculated, viz., the 
number of different combinations of 10 things taken 3 at a time. 
Each one of these combinations, it will be recalled, can be 
arranged in 3! different ways; i.e., there are 31 different ways in 
which a given committee can be selected. Accordingly, the 
total number of ways in which a committee of 3, i.e., just any 
jcommittee and not a particular committee, can be chosen is 
jequal to Cj® X 3!. But the total number of different ways in 
which a committee of 3 can be picked from a group of 10 is the 
number of permutations of 10 things taken 3 at a time, which is 
fqual ^ ^ Therefore, Cj® 3! = 10 X 9 X 8, and 

C|® = In general, the number of different com- 

binations of N things taken r at a time is 

■ . (jy-r + 1) 

or if numerator and denominator are both multiplied by (N — r) !, 

The Binomial Expansion. A use that is made of eombina- 
toiial theory in elementary algebra is to find a formula for the 
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expansion of the binomial (* + y)"- It will be recalled that 
(* + y)* = + y)(® + y) is found by multiplying each term 

of tl^ first factor by each term of the second factor and adding 
these partial products. Thus 

(* + y)* - xy xy + y* = + 2xy + y* 

A higher powered binomial can be evaluated by mere repetition 
of this process. Thus 

ix + y)* = (x + y)(x + y)(x + y) = (x* + 2xy + y*)(x + y) 

= X® + 2x®y + xy® + x®y + 2xy* + y* = x® + 3x®y + 3xy® + y* 

It will be noted that the result in each case consists of a series of 
terms in diminishing powers of x (or rising powers of y), and this 
is generally true no matter what the power of the binomial. 
It will also be noted that the number of times a given term 
occurs (t.e., its coefficient in the expansion) depends on the 
number of ways the x’s (or y’s) that make up that term can be 
selected from the different factors. Thus in the case of (x + y)* 
the term composed of three x’s, that is, x®, can be formed in only 
one way, namely, by taking an x from each of the three factors. 
The term x®y, however, which contains two x’s, can be formed 
in three ways. This is because the number of different com- 
binations that can be made of three x’s taken two at a time is 
3 • 2- 1 

2 Ti 7 i ~ Similarly, the coefficient of xy® is the number of 

different combinations of three x’s taken one at a time, which is 
3-2- 1 

1 , 2.1 ~ Accordingly, the expansion of (x + y)* might 

be written (x + y)® = Cjx® + C' 2 X®y,+ Cjxy® + Cjy®, where C| 
means the number of combinations of three things taken three 
at a time, C\ equals the number of combinations of three things 
taken two at a time, etc.,® the evaluation of these quantities to be 
determined by Eq. (3). If consideration were given to the 
power of y instead of x, this new method of writing the expansion 
of (® + y)* would become 

(x -I- y)» = C?x» -I- C!x®y -H Cjxy® + C|y« 

* Note that 0! is taken by convention to be 1, so that C{ •• 1. 
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In general, 

(x + yy « C%x^ + + • • • + Cj'x V“" 

+ (4) 

or, on using the second method of expression, 

(x + yY CqX^ + CiX^'^^y + • • • + / 

+ C%^xxy^-^ + (4a) 

Thus 

(x + 2 /)^ = C\x^ + C\x^y + Cix'^y^ + C\xy^ + Ciy^ 

= X* + 4x* 2/ + 6xV* + 4xy* + y^ 

and 

(x + 2/)* = CJx» + CJx^y + C\xY + C\x^y^ + C\xy^ + Cly^ 

= X® + 5xV + lOx®^^* 4* lOx^® + 5x2/* + 2/* 

It is in this way that the combinatorial formulas enter into the 
binomial expansion. Later it will be seen that a certain fre- 
quency distribution is called a binomial distribution'^ because 
its relative frequencies are computed in the same way as the 
coefficients of the terms of a binomial expansion. 

MATHEMATICAL PROBABILITY 

The concept of probability has been the subject of much 
debate among philosophers, mathematicians, and statisticians. 
To enter into this debate, however, would be beyond the scope 
of this book.^ Although the concept of probability presented 
below appears to be the most suitable for an elementary text 
and is apparently the one most in favor among statisticians, it 
must not be thought that other approaches are necessarily 
invalid or even possibly less fruitful.* 

* A brief review of the classical theory and the frequency theory of R. von 
Mises is presented in the Appendix, pp. 242-251. 

> The concept of probability presented in this book is patterned after that 
presented by J. Neyman in his Lectures and Conferences on Mathematical 
Statistics (Graduate School of the United States Department of Agriculture, 
Washington, 1937). His views, Dr. Neyman believes, ‘‘are shared by E. S. 
Bsarson and other workers attached to the Department of Statistics at 
University College, London.” He also refers to H. Cramer, Random 
Various and Prchahility Distributions (Cambridge, 1937); Maurice Frechet, 
Reeiierdkes theoriques modemes sur la theorie des probahilUis (Gauthiers- 
Vffiats^ IParss, 1937); A. Kolmogoroff, Orundhegriffe der Wahrscheinlich-- 
ksiUrs^iMiU (Julius Springer, Berlin, 1933); and D. J. Struik, “On the 
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Definition/ A discussion of probability can best begin with a 
fiime set of objects. Suppose that, in a given set of ( ob jects, m 
possfts s a gi ven pro perty and, ik d o, not “poss efis this^ 

Then the probability of an object of this set having the given 
property is m/t or the relative frequency of these objects in the 
set. The word object^’ as used in this definition is to be 
interpreted broadly. Besides objects proper, it may be taken 
to include events that have the property of occurring or even 
propositions that have the property of being true. 

To illustrate the above definition of probability, consider an 
ordinary deck of 52 pla 3 dng cards. This will have 26 red cards 
and 26 black cards; hence the probability of a red card in this 
deck is ff = h The deck also contains 13 cards of each suit, 
so that the probability of a heart, say, is ^ = i. This is also 
the probability of a diamond, or a spade, or a club. 

Description of Fundamental Probability Set. It should be 
especially noted that in defining a probability ftf 

to which it pertains must be precisely designated and the Erpp,- 
erty of an object to which the probability refers must be, .CMft- 
fully distinguished. For example, the probability of an ace in 
a pinochle deck^ is A = i and not = A, as it is in an ordinary 
deck. Furthermore, for the same set of cards, the probability 
of a card of! a given color is not the same as the probability of a 
card of a given suit or of a card of a given value. What is more 
important is that each of these properties and hence their prob- 
abilities pertain to a different classification of the objects of the 
set. As will be seen later, it is possible to add probabilities per- 
Ibaining to the same classification of the objects of a given set, 
?)ut not probabilities pertaining to different classifications, even 
though the set of objects is the same. A set of objects classified 
^jiLjBL*giy.emJway is_called a fundamental probability^ In 

all calculations it is very important to define carefully the funda- 
mental probability set that is involved. 

In this connection it should be noted that the ‘'probability of^ 
a heart in an ordinary deck of cards'' is not necessarily the same i 
thing as the “probability of drawing a heart from the deck.'*^ 

Foundations of the Theory of Probabilities,” Philosophy of Science (1934), 
Vol. 1, pp. 50-70. 

' A pinochle deck consists of 2 aces, 2 kings, 2 gueens, 2 jacks, 2 tens, and 
2 nines of e^-ch suit. There are no cards of lower value. 
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For the former, the fundamental probability set is precisely 
designated; it is simply the given deck of cards classified accord- 
ing to suit. The total number in the deck may be rq^dily 
counted; the hearts may be easily separated from the others; 
and their relative frequency, i.e., their probabiUty, may be 
directly computed. But what is the fundamental probability 
set to which the ‘^probability of drawing a heart from the deck^' 
j pertains? To this there are several answers. 
j Suppose that 100 drawings are made from the given deck, 
(the card drawn each time being replaced in the deck and the 
Iwhole well shuffled before the next drawing. Let the number 
^of hearts so drawn be 20. Here the fundamental probability 
set is the set of 100 drawings classified according to suit, and 
the probability of a heart in this set is = i. In this case 
also, the total number of objects can be counted and the num- 
ber having the given property can be readily ascertained. 

The “probability of drawing a heart from the deck^^ may, 
however, pertain to a set of 100 drawings to be made in the 
future. Here the total number of “objects'^ in the set is given, 
but there is no way of ascertaining how many of these drawings 
will yield hearts. In this case, the “probability of drawing a 
heart from the deck^^ is simply unknown. 

Finally, the “probability of drawing a heart from the deck^' 
may pertain to a set of hypothetical drawings, not actual draw- 
ings. If, it may be argued, 100 drawings should be made from 
the deck in the prescribed manner and if 30 of these should be 
hearts, then the probability of a heart in this assumed set would 
be The “probability of drawing a heart from the deck'' 
, refers in this case to a hypothetical set, 

i Infinite Probability Sets. Frequently, probability theory is 
concerned with an infinite set of objects. These are usually 
hypothetical sets but may in some cases be real sets, such as the 
inWty of points on a line. Without going into mathematical 
refinements, it may be said that the probability of an object 
of a given property in an infinite set is the percentage of such 
objects in the set. For the percentage of a particular kind of 
object in an infinite set may be finite even if the number of 
objects of the given property and. the total number of objects 
ftSre infijaite. For example, if. a coin is tossed indefinitely, 
bQ% the number of tossings and the number of tossings 
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yielding headfe may be increased without limit. Nevertheless, 
t the ratio of the number of heads to the total number of tossings 
will Itay within finite limits no matter how many tossings are 
made. For an infinite set, therefore, as well as for a finite set, 
the probability of an object having a particular property ia. the 
relative frequency of such objects in the given set.^ 

PROBABILITY AND THE RELATIVE FREQUENCY 
OF ACTUAL EVENTS 

In concluding this chapter a few words should be said about 
the relationship between mathematical probability and the 
relative frequency of actual events. As defined above, prob- 
ability is a constant characterizing a given set of objects; it is 
merely a mathematical abstraction. If the theory of prob- 
ability is to be of any practical use, however, it must be tied to 
the relative frequency of actual events. It must help, in other 
words, in making predictions about real life. 

The Law of Large Numbers. T he link that ties mathematical 
probability to the relative frequency of real events is actual 
experience with mass phenomena. This experience has been 
called the '^law of large numbers,’^ which says that, when a large 
number of random events is involved, it is usually possible to 
predict, with reasonable accuracy, the relative frequency of 
occurrence of a particular event by calculating a certain mathe- 
matical probability. To illustrate, consider once again an 
ordinary deck of playing cards. Mathematically, this can be 
looked upon as a set of 52 objects for which the probability of a 
heart is ^ = i. Let a large number of drawings, say 1,000, be 
made from this deck, the card drawn each time being replaced 
and the deck well shuffled before the next drawing. As already 
pointed out, no exact statement abput the number of hearts 
drawn can be made in advance of the drawings. Experience 
shows, however, that in random drawings of this kind the relative 
frequency of hearts drawn approximates fairly well the mathe- 
matical probability of a heart in the deck. Hence, in the given 
instance it may be predicted that of the 1,000 random drawings 
something close to 250 will be hearts, 

^ For a more refined definition of probability, see Neyman, op. ct^., pp. 
10 - 11 . 
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The foregoing is a very simple illustration of the law of large 
numbers. The law appears to be equally valid, however, for % 
more complicated calculations of probability. For exalhple, 
suppose there are two decks of cards, one an ordinary deck and 
the other a pinochle deck, and suppose that all possible combina- 
tions of two cards are made by combining one card from the 
ordinary deck with one card from the pinochle deck. Since the 
first card can be picked in 62 ways and the second in 48 ways, 
there will be 62 X 48 = 2,496 such combinations. Of these 
2,496 combinations, 4 X 8 = 32 will be pairs of aces; hence, in 
this set of combinations, 32/2,496 = is the probability of a 
pair of aces.^ Now let a very large number of drawings be made 
from each deck of cards, the card drawn each time being replaced 
and the deck well shuffled before the next drawing. Further- 
more, let the first card drawn from the ordinary deck be paired 
with the first card drawn from the pinochle deck, the second 
card from the ordinary deck with the second card from the 
pinochle deck, etc. Then, if the number of random drawings 
is very large, experience shows that the pairs of aces actually 
occurring in this large set of drawings wiU be close to times 
the total number of drawings. Again the relative frequency of 
actual events can be approximately predicted by the computation 
of a mathematical probability. In fact, if random mass phe- 
nomena are involved, the whole of the calculus of probability can 
be employed in the prediction of relative frequencies with satis- 
factory accuracy. 

Empirically Determined Probabilities, It might be pointed 
out in passing that in many instances the original set of objects 
is not completely known and the probability of a given property 
of the set must be determined empirically. For example, the 
total number of deaths in the United States of white males, age 
fifty, is not completely known. Indeed, so far as we know, 
deaths of men, age fifty, will continue to occur indefinitely. Thus 
of the total number of men who have reached and will reach the 
age of fifty, the number who have died or will die during their 
^tieth year is not precisely known. On the basis of the law of 
large numbers, however, it seems safe to assume that the many 
vital statistics that have been accumulated give a very close 
approximation to the true probability of death at age fifty. That 
JCSffOiap.X. 
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The quality of being a 2, 3, 4, 5, or 6 is the attribute of a face. 
Since a face of a die cannot be both a 2 and a 6, say, these attri- 
bute^ are mutually exclusive, and since a face must have one 
of the markings listed above, the total probability is again 1. 

Discrete Probability Distributions. When the attributes of a 
set are qualitative in character, such as spades or hearts in the 
case of a deck of cards or heads or tails in the case of coins, or 
when they are represented by a set of numerical values that do ; 
not vary continuously, such as the number of spots on the face ' 
of a die, the distribution of probability is said to be “discrete.^* 
If the attributes are represented by points on a horizontal axis 
and their probabilities measured along a vertical axis, a discrete 



Heads Tails 


Fig. 77. — Distribution of 
probability of heads and tails on 
a coin. 



1 2 3 4 5 6 


Spots on face of die 
Fig. 78. — Distribu'tion of 
probability of the spots on the 
faces of a die. 


probability distribution may be pictured by a series of lines or 
bars as in Figs. 77 and 78. It will be noted that it is the height 
of the bar in each case that measures the probability of the attri- 
bute at which it is erected. 

Continuous Probability Distributions. If the members of a 
set consist of the numerical figures obtained by the repeated 
measurement of the length of a given table or the continued 
measurement of the heights of adult white males, living now and 
in the future, the attributes assumed by the members may form a 
continuous variable. In such a case, the total probability of 
1 can be considered to be distributed over the whole range 
of variation; it will thus form a continuous'' distribution of 


probability. More exactly, the range may be divided into small 
class intervals, and location within one of these intervals may be » 


taken as the attribute of a member of the set. In this instance, ^ 
the probability of a member belonging to a given interval may be ii 
represented by the area of a rectangle erected over that interval,;: 


# 
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and the total distribution may be pictured as a set of such rec- 
tangles in the manner shown in Fig. 79. If, now, the class inter- 
vals are made smaller and smaller, the tops of these rectangles 
will tend to sketch out a smooth curve (c/. Fig. 80). A prob- 
ability curve of this sort can be looked upon as the limit 

approached as the class inter- 
vals into which the range is 
divided are made infinitesimally 
small. 

Frequency Distributions as 
Probability Distributions. 
From the definition of proba- 
bility given above, it follows 
that any frequency distribution in which the frequencies are 
expressed as a percentage of the total number of cases is a 
distribution of probability of the given set of cases. It likewise 
follows that a frequency curve that represents the distribution 
of relative frequency of an infinite population of cases^ is also a 
probability curve. 


64 66 68 70 72 74 76 

Inches 

Fig. 79. — A continuous distribution of 
probability. 



Since a distribution of relative frequency and a distribution of 
probability are thus one and the same thing, all the measures of 
the various characteristics of frequency distributions automati- 
cally apply to probability distributions. Thus a probability 
distribution has a mean, a standard deviation, a coefficient of 
skewness, and a coefficient of kurtosis, like any frequency 
distribution. 

» See p. 238. 
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ALGEBRAIC AND GRAPHIC REPRESENTATION OF THE NORMAL 
FREQUENCY CURVE 

Ftiiictioiial Relationships. Before entering upon a mathe- 
matical and graphic representation of frequency and probability 
curves, it may be well to review briefly the algebraic and geo- 
metric description of simple functional relationships. This is 
the purpose of the present section. 

If one quantity varies when a second varies, the first is said to 
be a function of the second. The pressure of air in an auto- 
mobile tire, for example, varies with the temperature; pressure is 
thus a function of temperature. Again, the quantity of butter 
bought varies with its price; hence, the purchase of butter is a 
function of its price. 

Functional relationships of this kind are described symbolically 
by such expressions as 2/ = fix), y = Fix), y = Gix), y = <pix), 
and y = ypix ) — all of which are to be read is a function of x^* 
or, more specifically, ^^y is the/ function of x,^^ ^^y is the F func- 
tion of x,^^ etc. The expressions y = fix) and y = Fix) are the 
most common; the others are often used, however, when a 
problem involves more than one functional relationship. For 
example, if y and z are both functions of x, this may be expressed 
by V = fix) and z = gix). 

Frequently a quantity varies, not merely with one, but with a 
number of other quantities. The former may then be said to be 
a joint function Of the latter. Thus the volume of gas in a 
tube is a function of the pressure and the temperature (Boyle's 
law) ; the quantity of butter bought is a function of the price of 
butter and the income of its purchasers. Joint functional rela- 
tionships of this kind are expressed by y = fix,z), y = Fix,z), 
y = (pixyz), etc., — all to be read ‘‘i/ is a function of x and 2;." 

Explicit and Implicit Functions, The functional relationships 
so far considered are explicit" functions. In explicit functions ' 
one variable is selected as the dependent variable, and the other 
or others as the independent variables; this is indicated by writ- 
ing the dependent variable to the left of the equal sign. Often, 
however, it is convenient to talk of two variables as being func- 
tionally related without indicating which is to be taken as 
dependent and which as independent variable. Such a func- 
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tional relationship is indicated by f{x,y) = 0, F{x^y) = 0, 
fpix^y) = 0, etc., or if there are more than two variables, by 
f(x,yfZ,) = 0, F(Xfy,z) = 0, <p{x,yjz) = 0, etc. These all mean 
that X and y or x, y, and z are ^^functionally related.'' Func^tions 
of this kind are called implicit " functions. An explicit function 
can often (although not always) be derived from an implicit 
function by merely solving the latter for the variable selected as 
dependent. 

The simplest type of functional relationship is expressed by a 
polynomial. A polynomial in x means such expressions as x 
(strictly speaking a monomial), a + a:, a + bx, and a +. 6a; + cx^- 
The —degree" of the polynomial is the highest power of x that 
occurs in the expression. Thus a + 6a; is a polynomial of the 
first degree; a + 6a; + ca;^ and a + cx^ are polynomials of the 
second degree; and a + 6a; + ca;^ + dx^, a + bx + dx^y 
a + cx^ + dx^y and a + dx^ are polynomials of the third degree. 
Polynomials in two variables, say x and z, are illustrated by 
a + 6a; + ^ 2 , a + 6a; + cx^ + + hz^, a + bx + gz + mxZy and 
a + 6a; + kz^, the first being of the first degree, the second and 
third of the second degree, and the last of the third degree. 

If y varies by a constant absolute amount every time x varies 
by a fiMd^^ven amount, the function that expresses this rela- 
tionship is a first^gre^ polynomial in a;, such as y = a + 6a;. 
For every time x increases (or decreases) by one unit, y increases 
(or decreases) by 6 units. Thus, if 2 / = 10 -1- 2a;, y increases (or 
decreases) by two units every time x increases (or decreases) by 
one unit. If 6 is negative, then the variation in y is opposite in 
direction to that of x. For example, if = 10 — 2a;, then y 
decreases (or increases) by two units every time x increases (or 
decreases) by one unit. The quantity a is the value of y when 
X equals 0; if y = 0 when a; = 0, then a must be zero. 

When a functional relationship can be represented in this way 
by a first-degree polynomial, such as 2 / = a + 6a;, then y is said 
to be a * linear" function of x. This continues to hold true 
when there is more than one independent variable. Thus 
y == a + bx + gz is said to express y as a, linear function of 
x and z. If the change in y that accompanies a given change in x 
varies with the value of a;, then the functional relationship can 
no longer be expressed by a first-degree polynomial in x. The 
function is in this case ‘^nonlinear," and a higher degree poly- 
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nomial or some more complex expression must be employed to 
indicate the relationship. A nonlinear relationship between y 
and Wo or more variables, such as x and 2 , must also be expressed 
by some function other than a first-degree polynomial in these 
two variables. 

Graphs of Simple Functions. A pair of values, x and j/, may . 
be represented by a point in a plane. The point P in Fig. 81, 
for example, represents the value x = xo and y = yo] i.e., the 
coordinates of P are (xo,?/o). 

First-degree Polynomials. The graph of a first-degree poly- 
nomial is a straight line (hence the name linear^’ relationship). 



Fig. 81. — Plotting of a point. 



Fig. 82. — Graph of a straight line. 


and conversely every straight line may be represented algebrai- 
cally by a polynomial of the first degree. The simplest way to 
comprehend this relationship between the algebraic and the 
geometric presentation is to think of a straight line in reference 
to (1) the angle it makes with the a;-axis and (2) the intercept 
it cuts on the ^/-axis. Thus, i n Fig. 82»,let d represent the angle 
CAB that the line makes with the a;-axis at A. The tangent of 
this angle is the slope of the line AB. Let h represent this slope; 
then h = tan B, It is evident that a straight line is determined 
when its slope h and its y intercept, a (or OB), is found, as follows: 
Let P (whose coordinates are any x, y) be a representative point 
of the line. Take PC, the perpendicular to Ox, from P, and 
BD, the perpendicular to PC, from B. Then, 

DP 

tan DBP = 
or 

DP = tan DBP X BD (1) 

But 

DP ^ CP - CD CP - OB ^ y - a 
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Also, 

BD = OC = X and tan DBF = tan CAP ^ 

(similar triangles) 

and 

tan CAP = tan ^ = 6 

Now, since DP — y — a, and tan DBP = tan 0 = 6, and BD = x, 
then, upon substituting in Eq. (1), 

y — a ^ hx or i/ = a + (2) 

A straight line thus represents y as equal to a polynomial in x 
of the first degree. When the line passes through the origin, 
a = 0 and the functional relationship becomes simply y = bx, 
Secondnlegree Polynomials, If the functional relationship 
between x and y takes the form of a second-degree polynomial, 

such as y a + hx + its 
graph will be a parabola. By 
definition, a parabola is the locus 
ty') of all points equidistant from a 
fixed point called the “ focus and 
a fixed line called the ^ directrix. 
In Fig. 83, the focus is the point 
(F), and the directrix is the line 
Fig. 83.— Graph parabola. KS{y + ¥ Take any point 

, * on the parabola, P{x',y^)y and 

draw from it a line perpendicular to KS^ at point M ; then draw 
a line from the focus (F) to the point P. 

At the point a: = 0, y = 0, it is obvious that the parabola is 
equidistant from the directrix y + F = 0, or y = ^F, and the 
focus at X = 0, y = F. 

Now, if a line were drawn from P perpendicular to the y-axis, 
at C, it is clear that FP^ = (y — F)^ since FC — y — F and 
PC = X, This is true since the square of the hypotenuse is equal 
to the sum of the squares of the other two sides of a right-angled 
t triangle. Furthermore, it can be seen from Fig. 83 that 
= (y + F)2 because MP is y + F, By hypothesis, 
JETP* = FP^j and hence, by substitution. 



c 


{y + FY = (y - F)* + 



PROBABILITY DISTRIBUTIONS 


259 


which by transposition and simplification becomes 
= iFy or 2/ = 

This is of the general form of y = cx'^. If the curve had not 
passed through the origin, its equation would have been of the 
form 2 / = a + and if its vertex had come either to the right 
or left of the y-axis, its equation would have been 

^ = a + fcx + cx^ (3) 

If the value of c is negative, the curve turns down as in Fig. 84 
instead of up as in Fig. 83. These parabolic curves thus illus- 
trate the form of a functional relationship in which y is set equal 
to a second-degree polynomial in x. 



Fia. 84. — Graph of a parabola, Fig. 85. — Graph of a circle, 

y = — cx2 (c > 0). X* 4- 2/2 = r*. 


The Circle, The implicit functional relationship 

3^2 ^2 _ ^2 — 0 (4) 

is of interest in that its ^raph is a circle. By definition, a circle 
is the locus of all points equidistant from a fixed point called the 
“center.” In Fig. 86, the center is taken at the origin. By Che 
property of right-angle triangles, the square of the distance of 
any point P from the center at 0 (the origin) is simply x^ + y^. 
Since by definition this must be the same for all points on the 
circle,# the elation of the circle is simply x^ -f 2 /^ = whe^e 
r is the radius. By transposition, this becomes x^ + y^ — f ^ = 0. 
If the center of the circle had been at the point (a, 6) instead of 
at the origin, its equation would have been ; . 



260 


THE NORMAL FREQUENCY CURVE 


An implicit functional relationship, therefore, in which two 
variables enter to the second degree with identical coefficients 
but in which there is no cross-product term (such as xy) is simply 
the algebraic expression for a circle. 

The Ellipse. If the functional relationship represented bv a 
circle is modified so that the coeffic ients of the second-degree 
terms are no longer identical (but still retain the same signs), its 
geometric counterpart is distorted so as to become an ellipse. 
Thus the graph of the implicit functional relationship 

ax^ + bi/2 _ y.2 = Q (5) 


is an ellipse whose semimajor axis is r/y/a and whose semiminor 
axis is r/\/6 (c/. Fig. 86). If 6 is less than a, then the ellipse 
runs the other way, as in Fig. 87. 



It will be noted that, if either of the implicit relationships 
x^ = 0 or ax^ -f = 0 is solved for y, the 

resulting solution gives y, not as a single-valued function of x, but 
as a double-valued function. Thus, in the case of the circle, 
y = ± ^ 7-2 — which shows that, for each value of x, there are 

'■ I'jf2 dX^ 

two values of y. Likewise, for the ellipse, y = 

which again gives two values of y for each value of x. Geo- 
r metrically this means that a line perpendicuj^r to the x-axis 
cats thej5ircle and ellipse at two different points. In contrast, 
the ‘Straight line and the parabola express y as a single-valued 
function of x. The difference is due to the fact that, in the case 
of the-circleilnd ellipse, it is y® and not y that is expressed as a 
po^fRMiftial in x and when the square root is taken two values of 
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Rising Exponential Function, A simple function that does not 
express y or as a polynomial in x is the exponential, or ‘‘com-/ 
pound-interest,^^ function. Suppose that a sum of $1 is put' 
out at interest at 6 per cent per year compounded for 3 years; 
then the value of this sum at the end of the 3 years would be 
as follows : 

Value at end of 3 years = (1 -f 0.06) (1 + 0.06) (1 + 0.06) 

= (1 + 0.06)® 

In general, if $1 is put out at interest for x years and compounded 
at the rate of r per annum,* its value at the end of the x years 
would be as follows : 


Value at end of x years = (1 + r)® 

If the interest is compounded every 6 months instead of every 
year, then the value at the end of x years would be as follows: 

Value at end of x years ~ ^ 



for the interest rate for 6-months is just half the rate for a year 
and the number of 6-month periods is just double the number of 
years. If the interest is compounded every quarter, then the 

value of the $1 at the end of x years would be ; and 

if the interest is compounded every 1/nth of a year, the value at 

/ rV" ’ 

the end of x years would I ^ ^ ^ always being the rate of 

simple interest per year. This last may be written 


(‘i)‘ 


If, now, n is made infinitely large, in other words, if the period of 
compounding (that is, 1/nth of a year) is made infinitesimally) 
small, so that the operation of compounding may be viewed .as 

/ l\- \ ^ . 

practically continuous, then the quantity / 1 H — \ is equal 

'i ?/ ■ : 

approximately to e, the base of the Napjf|iajg^.system of loga- •. 
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rithms. 


For, by definition, e is the limit of 



as m 


approaches infinity. The value of $1 at the end of x years when 
''interest at r per annum is compounded continuously is therefore 




The quantity 6, it will be recalled, is a numerical constant 
equal approximately^ to 2.7183, so that the value of a sum 
compounded continuously for x years is given by a function of 
the form y = a*"*, where a and 6 are merely constants. The 
compound-interest function is thus an exponential^^ function. 



for the vari able x enters into the fun ction as an exponen t. A 
graph ot tnis function for a positi^ value ort(= r) is shown in 
Fig. 88. 

^ Dedining Exponential Function, The '^present value of $1 
due x years hence, interest being compounded continuously at 
the rate of r per annum, would be equal to For since the 

sum of $1 would accumulate to e*** dollars at the end of x years, the 
sum of e“**'*(= l/e’’*) dollars would accumulate to 

6”r» . e" = 6® = 1 dollar 

* This may be easily proved by putting increasingly large values of m in 
Jhe formula ^ • Thus, 

. for-m -lo, e - (1 -f - (l.l)i® - 2.594; 

for.m - 50, « - (1 + *)«« - (1.02)»» - 2.691; 

» i 1 \ 1.000 

for w • l,00c; « - ^1 + - (kOOl)! """ - 2.7171; 

'■for w - 10, OM, e - (i + 1/'10,000)«''”» - (1.0001)> »-« »« r. 2.7,182,.,;: 

being Warithms (see Anoen^.' IV* ' 
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at the end of that time, and hence the present value of $1 due x 
years hence is This ^^discount^^ function^ as it mav he 

callea, is thus a negative exponential function . Its graph is 
shown in Fig. 89. 

Whereas the exponential functions are not themselves poly- 
n qmials, it will be noted th at the expression that constitutes the 
“exponent of is a first-degree polynomial in x (more accu- 
rately, the monomial x). This 
suggests that interesting modi- 
fications of the exponential func- 
rtion might be obtained by 
m aking the exponent of e a 
second or higher degree poly - 
nomial in X. A function of this 
nd that of very grea t impor- 
I tance in the theory brfrequcncy 



teurves is y = e~ 


Figure 90 


Fig. 90. — The bell-shaped (normal) 
curve, y =* e”***(c > 0). 


bhows how well this function represents the general shape of a 
irequency distribution; as will be seen shortly, it is the kernel of 
the formula for the normal frequency curve. 

Formula for the Normal Frequency and Probability Curves. 
Formulas for frequency and probability curves may be written 
in two ways. The first, which may be symbolized by y = 
merely expresses the ordinate of the curve 2 / as a function of the 
attribute X whose frequency is being measured. is used here 
to signify “function of,” instead of the usual F or /, in order to 
avoid confusion with the employment of the latter to indicate 
frequencies.) In this form the formula simply describes the 
locus of the points that constitute the frequency or probability 
curve. 

The second method of writing a frequency or probability 
formula is d{F/N) or dP = ((>{X) dX. In this form, the prob- 
ability or relative frequency of a case lying between X and 
X + dX is expressed as a function of the attribute X. It will 
be recalled that a probability or frequency curve is the limit 
approached by an area histogram as the class interval is made 
infinitesimally small. The expression d(F/N) or dP = <p(x) dX 
merely says that, when the class interval (of size dX)^ is made 

^ The letters dX, dP, d{F/N)j etc., are to be read as a single symbol and 
not as the p^/)duct d and X or d and P, etc. The symbol dX mean s an 
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infinitesimally small, the area under the curve for any class 
interval [that is, d{F/N) or dP\ is approximately equal to the 
area of a small rectangle whose base is dX and whose height is 
the ordinate of the curve ip{X) at X (cf. Fig. 91). This second 
method is, strictly speaking, the proper method for describing a 

‘‘probability^' or “frequency" 
curve, since it is this and not the 
first which actually expresses the 
probability or frequency of a given 
class interval as a function of the 
attribute X. The former gives 
merely an algebraic expression for 
the curve and not the area under 
the curve. ^ 

The Normal Curve, One curve 
occurs very often in statistical 
analysis, especially in the theory 
of sampling. This is the normal 
frequency curve. Its algebraic and graphic representation may 
profitably be illustrated by a brief description. 

The mathematical formula for the normal curve is 

dP = — dX (6) 

V2ir 

where c(= 2.7183+) is the base of the Napierian system of 
logarithms, X is the mean of the distribution, and is its stand- 
ard deviation. Pictures of the curve are given in Figs. 92a and 
926. It will be noted that the curve is symmetrical and gen- 
erally bell-shaped. 

Owing to its symmetry, the center of the curve comes at the 
mean point X = X. Here t he curve also reaches its greatest 

height, and from there it slopes gradually downward 

V27r - — 

on each side as the factor e assum es greater and greater 

mfiniteaimally small part of the X range, the symbol dP means an infinitesi- 
mal probability, and the symbol d(F jNt) means aiT infinitesimal relative 
Frequency: 

' The first method of description is, however, the only form that is appor- 
priate for a discrete distribution. 


Area under 
curve 

dPorcl(P/N), 
given approx- 
imafeiy by 
area of this 
recfang/e 



ly=<prxJ 


dx 

Fig. 91. — Graphical representa- 
tion of a probability function. 
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significance.^ The points of inflection, f.e., the points at which 
the left and right branches change from concave to convex 
downward fall at X = X ± <r. About two-thirds of the area 
under the curve lies between these two points of inflection, and 
about 95 per cent between X — 2cr and X + 2cr, The average 
deviation equals 0.7979 times the standard deviation, and the 
fourth moment equals three times the square of the second 
moment (hence 02 = S), 



Fio. 92a. — Graphs of normal frequency curves with different means but same 
standard deviations. 


As will be noted from the formula, a particular normal curve 
is determined when its mean X and its standard deviation a are 
given. Different normal curves, then, will have either different 
means or different standard deviations, or both. If the means 
are different, the curves will have different positions on the X- 
axis; if the standard deviations are different, the curves will be 
of different widths. This is illustrated in Fi^. 92a and 926. 

Although normal curves may thus differ in respect to their 
means and standard deviations, they all possess the essential 
^^normar' form. This may be brought out by measuring the 
attribute X, not in original units, such as pounds or dollars, but 
as a deviation from the mean of X measured in standard-devia- 


1 It will be noted that e 


— When X = X, this becomes 

(X-X)» ' 

e 

2<r* 


l/e° » I = 1. As X moves away from X in either direction, e 
becomes larger and larger and hence — — becomes smaller and smaller. 

(X a)* 


2a* 


All the ordinates are positive since the exponent of c is squared. 
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X — X X 

tion units, t.e., as qt — Whenever this is done, all 

<r <r , 

normal curves become one and the same curve. / .. 

Suppose, for example, that X in one case represents pounds 
and in another quarts. In both cases the probability or relative 
frequency is “normally” distributed, but the mean of the first 
distribution is 4 pounds and its standard deviation is 1.5 pounds. 
This distribution is represented by curve A in Fig. 92o. The 
second distribution has a mean of 6 quarts and a standard 
deviation of 0.6 quart; it is represented by curve D in Fig. 926. 
If, however, the unit adopted for the measurement of the attri- 


bute of an element is in the first case 


X lb. - 4 lb. 
1.5 lb. 


and in the 



Fio. 926.— Graphs of normal frequency curves with same mean, but different 
standard deviations. 


second case 


X qt. — 6 qt. 


then in terms of these 




0.5 qt. or fy 

units the two distributions will be identical. It is thus possible 
to reduce all normal distributions to a standard normal form^ 
(see Fig. 93). 

Consider now the relationship between the mathematical 
formula for the curve and this standard normal form. In the 
individual or nonstandard form, the formula for the curve is 

1 -Hzj?!* 

that given above, viz., dP = — e dX. This says that 

O’ v27r ■' 

the infinitesimal portion of the total area unde r the cur ve dP cut 
off by the infinitesimal class interval ^ to X + dX is approxi- 
mately equal to the area of an infinitesimal rectangle whose 


^ It will be noted that in the standard form the height of the middle 
prdinate is l/y^ « 0.3989. 
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1 

height is = e and whose base is dX. If now the 

<r\/^ 

I X — X X 

attribute is expressed in — - — = - units, the mathematical 

formula becomes, 



In this case, the infinitesimal portion of the total area cut off 

1 


X X 

by the infinitesimal class interval - to - + d 

C <T 


the area of a rectangle whose height is — e 

V 27r 


-KO' 


and whose 



Fig. 93,^ — Graph of a standard normal curve. 


base is d^x/d). In other words, the effect of measuring the attri- 
bute in x/<r units instead of X units is to change the size of the 
unit class interval on which the rectangle rests but at the same 
time to change its height proportionately so its area dP remains 
the same. 

1 — 

A table giving the value of — 7 = e for various values of 

y/2ir 

x/a will be found in the Appendix, Table VII. These are the 
ordinates of the standard normal curve. It was by means of 
this table that Fig. 93 was plotted; its use in “fitting^' the normal 
curve to an actual set. of data will' be explained in the next 
chapter.^ 


^ See pp. 277 and 295-304. 



CHAPTER X 

PROBABILITY CALCULUS 

Two Fundamental Theorems. There are two fundamental 
f theorems in the calculus of probabilities. These are the addition 
theorem and the multiplication theorem. There is also a special 
form of the latter that pertains only to * independent’^ attributes. 

Addition Theorem. The addition theorem pertains to the 
summation of the probabilities of one and the same probability 
set. Since the several attributes of a given probability set are 
necessarily mutually exclusive, it follows that the relative fre- 
quency of cases having either one of two attributes is the Hum 
of the relative frequencies of the separate attributes. For 
example, there are 13 hearts in an ordinary deck of 52 playing 
cards, and also 13 diamonds. The relative frequency of a heart 
is therefore the relative frequency of a diamond is also if • 

The relative frequency of a red card, t.e., either a heart or a 
diamond, is ff^ which equals ff + ff . Since the relative fre- 
quencies of the attributes are by definition their probabilities, it 
follows that the probability of a member of a set having either 
one of several attributes is simply the sum of the individual 
probabilities. The probability of a heart or a diamond is thus 
H + if = M “ This j^ddition theorem is valid fQt infinite 
probabUity sets as well as for li^tejei^. 

Algebraically the addition theorem may be expressed as 
follows: If the attributes of a given set are Xi, Xs . . . , X. 
(representing either qualitative or quantitative characteristics) 
and their probabilities are pi,p 2 , . . . , p., then the probability of 
Xi, Xi, or X», Bay, that is, the attribute of being any one of these 
X’s, is simply pi + p* + P«. If the variation in attributes 
within the set is continuous an d if the distribution of probability 
is described by a formul«»^ = <ofX) dX, the n the prob- 

ability of an attribute within any one of a number of small 
ranges dX whose sum constitutes the range Xi to Xj is given 
by 2 dP » 2^(X) dX, or, in the symbolism of the integral 
• 268 • 
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calculus, / dP = / (p{X) dX. The significance of this theorem 
will become clearer when its application to particular problems is 
considired. 

The addition theorem is sometimes stated as follows: The prob- 
ability of either one of two mutually exclusive events (attributes) 
is the sum of their individual probabilities. This version of the 
theorem is perfectly valid if it is understood that the two attri- 
butes belong to the same set. It will be recalled that all the 
attributes of a set are mutually exclusive.^ It is not true, how- 
ever, that all mutually exclusive attributes belong to the same 
set. To illustrate this point von Mises gives the following 
examj)le^ 

Suppose that the probability of a man dying between his 
fortieth and forty-first birthdays ia O.On and the probability of 
his marrying between his forty-first and forty-second birthdays 
is O.OQSt. These events are mutually exclusive, but it cannot 
be said that the probability of a man either dying in his fortieth 
year or marrying in his forty-first year is 0.011 + 0.009^i= 0.02. 
The two events do not belong to the same set, and the addition 
theorem can be validly applied only to attributes of one and the 
same set. 

Multiplication Theorem, The multiplication theorem pertains 
to the calculation of a probability of a derived'^ or ‘‘second- 
order” probabilit}^ set from the probabilities of two or more 
“first-order” sets. Consider, for example, an ordinary deck of 
playing cards and a pin ochle deck . Each may be said to con- 
stitute a “first-order” proBabiiity set. In the first set the prob- 
ability of an ace is = tV, and in the second the probability of 
an ace is A ^ i. Let the third set be formed from these two 
first-order sets by combining each card of one first-order set with 
each card of the other first-order set. Furthermore, let the 
attribute of any pair be the values of the cards making up the 
pair, such as a king and a nine, an ace and a queen, a two and a 
ten, and the like. The probability of any one attribute in this 
second-order set, say the probability of a pair consisting of two 
aces, is the relative frequency of such pairs among all possible 
pairs that might be formed. It is the purpose of the multiplic a- 
tion theorem to give a general rule by which a probability of a 

^iSee p 252. ” 

• Prohahilityf StcUiaticSy and Truths p. 54. 
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sgcond-or ^^fr of fViig VinH can be compute d fro m th e prob- 
abilities of^ the first-order sets. "" 

In deriving the multiplication theorem, two cases nfust be 
distinguished, one pertaining to independent probabilities and 
the other to dependent probabilities. Consider first the case of 
independent probabilities. Suppose that in finding all the vari- 
ous pairs of cards that might be composed of one card from each 
deck there is no limitation on how the cards might be matched. 
Then each card of the ordinary deck will have associated with it 
every card of the pinochle deck. Hence the set of pinochle cards 
that are paired with the ace of spades, say, from the ordinary 
deck will be the same as the set of pinochle cards paired with the 
jack of diamonds, say, or in fact with any other card from the 
ordinary deck. 

Since the set of pinochle cards paired with each card of the 
ordinary deck is thus the same as the set of pinochle cards paired 
with every other card from the ordinary deck and since this 
common set is identical with the original pinochle set, it follows 
Hhat the probability of any given pinochle card in the set paired 
with any given card of the ordinary deck is the same as the 
probability of that pinochle card in the original pinochle deck. 
This being the case, the probability of a card from the pinochle 
deck is said to be independent of the attribute of the card from 
the ordinary deck. 

^ In general, the probabilities of set II are said to be independent 
of the attributes of probabiUty set I if, when the members of the 
two sets are paired together, the probability of an attribute in 
the subset of members of set II paired with any given member of 
set I is the same as the probability of that attribute in the 
original set II. \/Symbolically,P(J5) is independentof ifP(J5/il), 
that is, the probability of B given A, equals P(JB), that is, the 
probability of B. 

Consider once again the given example. Since each card of the 
ordinary deck is associated with each card of the pinochle deck, 
the total number of different^ pairs of cards that can be formed is 

^ Different in the sense that the cards going to make up any pair are not 
precisely the same cards as those making up any other pair. This means 
that the two aces of spades, the two kings of spades, the two jacks of dia- 
monds, etc., in the pinochle deck must be considered as different cards, even 
though their vklue and suit are the same. 
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62 X 48, for there are 52 ways in which the first card can be j 
picked and 48 ways in which the second card can be picked, j 
Likewise, the number of combinations that would consist of two 
aces would be 4 X 8 = 32. Hence the probability of a pair of 
aces in the whole set of pairs of cards is 32/2,496 = V?- But 
from the calculations this is seen to be equal to ^ X A = tV- 
In other words, the probability of a pair of aces in the second- ' 
order probability set is equal to the probability of an ace in the | 
ordinary deck times the probability of an ace in the pinochle 
deck. 

xXThe multiplication theorem for independent probabilities may 
thus be stated as follows: If pa is the probability of a member 
of set I having the attribute A and if qt is the probability of a 
member of set II having the attribute B, and if qh is independent 
of the attribute assumed by a member of set I, then the proba- 
bility of a member of the derived set I-II having the attribute 
AB, that is, the probability of a pair AB among all possible pairs 
of two elements from each of the two sets I and II, is the product 
of the probabilities pa and ^b.'^In simpler form, if P(B) is inde- 
pendent of Aj the joint probability of A and B is equal to the 
probability of A times the probability of B, that is, P{AB) 
^P(A)^P(B). 

Consider next the case of dependent probabilities. Suppose 
that, in picking pairs of cards from the two decks, the following 
modification is introduced. Suppose that every time the card 
picked from the ordinary playing card deck is an ace, a king, a 
queen, or a jack and that, before any selection is made from the 
pinochle deck, an ace,^ and a queen from each suit is^ 
discarded from the deck and is replaced by a lack, a ten, and a 
nine of each suit. The pinochle deck would then contain the 
same number of cards as before, but *. « would have 4 aces, kings, 
and queens instesH of Sy^iid 12 jacks, tens, and nines instead of 8. 
After the pinochle deck has been modified in this way, let a card 
be selected from it and combined with the ace, king, queen, or 
jack picked from the ordinary deck. If the card picked from the 
ordinary deck is not an ace, a king, a queen, or a jack, let no 
modification be made in the pinochle deck. The effect of this 
modification in the method of forming pairs of cards is to make 
the attributes of the secofid set dependent on those of the first. 
For the s^t of cards of theVinochle deck associated with an ace, 
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|t%ing, a queen, or a jack of the ordinary deck is different from 
%he set of cards of the pinochle deck associated with other cards 
of the ordinary deck. The probability of a given type of card 
from the pinochle pack now depends on the card from the 
ordinary deck; no longer does P{B/A) = P{B). 

The number of different pairs of cards that can be made in the 
way just described is 52 X 48 as before, for the first card can stUl 
be selected in 52 ways and the second in 48 ways. The number 
of pairs of cards consisting of two aces, however, is now 4X4 
instead of 4 X 8 as in the previous example. Hence the proba- 
bility of a pair of aces among the whole set of pairs is 

4X41 
52 X 48 156 

It will be noted, however, that this probability of a pair of aces 
is the product of the probability of an ace in the ordinary deck 
(that is, times the probability of an ace in the modified 
pinochle deck (that is, •^). In other words, the probability of a 
pair of aces is the probability of an ace in the ordinary deck 
times the probability of an ace in the pinochle deck given the 
selection of an ace from the ordinary deck. 

The multiplication rule for dependent attributes may thus be 
stated as follows: If members of probability set II are paired 
with members of probability set I in such a way that the prob- 
ability of an attribute in the subset of members of set II paired 
with any given member of set I varies with the attribute 
of that member of set I, then the probability of the pair AB in 
the whole set of paired values is equal to the product of the 
probability of A in set I times the probability of B in that subset 
of set II associated with the given attribute A. Symbolically, 
the multiplication rule ic^ dependent attributes is 

P{AB) = P{A) -PiP/A) 

that is, the probability of A B equals the probability of A times 
the probability of B given A. Since, when P(B) is independent 
of Ay P{B/A) = P{B)j the multipHcation rule for independent 
probabilities is a special case of the general formula 

P{AB) = P{A) • P{B/A) 

The multiplication theorem for both dependent and independent 
probabilities is valid for infinite sets as well as finite sets. 
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The significance of independence and dependence may be 
illustrated by a case pertaining to real life. Suppose that one 
prol^ability set consists of American fathers of the white race 
and the other probability set consists of their eldest sons. Sup- 
pose the attributes distinguished in each set are dying from 
cancer, dying from heart disease, dying from tuberculosis, and 
dying from other causes. If, now, the probability of a son dying 
from tuberculosis, say, is greater, among those sons whose fathers 
died of tuberculosis, than among the whole group of sons, then 
the probability of a son dying from a certain cause is not inde- 
pendent of the cause of death of his father. If, on the other 
hand, the probability of a son dying from any particular cause 
is the same for the sons whose fathers died from cancer, the 
sons whose fathers died from heart disease, the sons whose fathers 
died from tuberculosis, and the sons whose fathers died from 
other causes, t.e., if the probability of a son dying from any 
particular cause is the same, whatever the cause of the father^s 
death, then the probability of a son^s death is independent of the 
cause of death of his father. For example, a case of dependence 
would be the following: 


Probability of death of eldeut eon from 



Cancer 

Heart 

disease 

Tubercu- 

losis 

Other 

causes 

Eldest sons whose fathers died of 
Cancer 

0.310 

0.102 

0.030 

0.558 

Heart disease 


0.151 

0,041 

0,590 

Tuberculosis 


0.118 

0.093 

0.569 

Other causes 

0.215 


0.042 

0.631 

All eldest sons 

0.228 


0.046 

0.606 



A case of independence would be that in which the figures 
in every row of every column were the same and these in turn 
were the same as the figures for *^A11 sons.'^ If the probability 
of a father dying from cancer was 0.228, from heart disease 
0.120, from tuberculosis 0.046, and from other causes 0.606, 
then, in the case of dependence represented by the above table, 
the probability of both a father and an eldest son dying from 
heart disease would be (0420)(0.151) = 0.018. In the case of 
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independence, on the other hand, the probability of both a 
father and an eldest son dying of heart disease would be 

(0.120) (0.120) = 0.014 ^ 

Illustrations. The following examples will help to illustrate 
the use of the addition and multiplication theorems in the cal- 
culation of desired probabilities. Some will also serve to illus- 
trate the use of the normal probability curve. 

Examples Involving Discrete Distributions, Suppose that a 
gambling game consists of the random tossing of five coins. 
You agree to pay your opponent a predetermined sum of money 
whenever all five coins turn up heads; he agrees to pay you a 
predetermined sum whenever any other result occurs. The ques- 
tion is: What should the odds be to make the game a fair one? 
The answer is obtained as follows: 

Assume that the character of the coins and the method of 
tossing are such as to cause each coin to tend to turn up heads 
and tails in equal proportion. In a large number of tossings, 
therefore, the probability of a head on each coin may be taken 
as i. Assume, also, that the method of tossing is such as to 
make the tosses of each coin independent of the others. Then 
the probability of heads on all five coins is 

mmmm = ay = a 

Accordingly, the fair odds are 31 to k That is, the game will 
be fair if you agree to pay your opponent $31 every time five 
heads occur and he agrees to pay you $1 every time some other 
result occurs. Of course, in an actual game the assumption 
regarding the character of the coins and the method of tossing 
would have to be checked by examination of the coins and by 
trial tossings. This is an illustration of the multiplication 
theorem for independent probabilities. 

Ano1j|ier gambling game consists of the throwing of two dice. 
You agree to pay your opponent a predetermined sum when- 
ever a combination totaling 7 appears, ^nd he agrees to pay 
you a predetermined sum whenever another result appears. 
Again the problem is to determine fair odds. This may be done 
by a combined application of the multiplication and addition 
theorems. 
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Assiiwfe agaiijbhat tM dice are^f such a cha|rhcter and are so 
thrown that all faces tend to turn\up in equal pi^gportions. The 
probability of any given result for each die is therefore The 
six possible combinations that add up to 7 are (1,6), (2,5), (3,4), 
(4,3), (5,2), and (6,1). If it is assumed that the dice are thrown 
so as to give independent results, then, by the multiplication 
theorem for independent probabilities, the probability of each 
one of the above combinations is (i)(i) = Any one of the 
combinations, however, will yield a total of 7. The probability 
of a total of 7 is therefore the probability of any one of these 
combinations, which, by the addition theorem, is 

(^V) + (^V) + (isfV) + (^V) + (-ffV) + (-gV) = A = i 


Hence, fair odds are 5 to 1 ; that is, the game will be fair if you 
pay your opponent $5 every time a 7 occurs and he pays you $1 
every time some other total occurs. Again, in a real game the 
character of the dice and the method of throwing should be 
checked to see if the above assumptions are warranted. 

Consider still a third gambling game. Suppose that two cards 
arc drawn at random from a well-shuffled pack, the suit is noted, 
the cards are returned to the pack, the latter is shuffled, and 
the whole operation is repeated. Each time the cards are all 
spades you agree to pay your opponent a predetermined sum; 
if they are otherwise, he pays you. What are fair odds? 

Assume as in the other games that the method of drawing 
cards and the method of shuffling are such that all cards tend 
to be drawn in equal proportion. The probability of a spade 
among the first cards drawn will thus be hh assuming the usual 
deck of 13 spades, 13 hearts, 13 diamonds, and 13 clubs. If 
the first card drawn is a spade, the remainder of the deck con- 
tains 12 spades and 13 of each of the other suits. If in a large 
set of drawings each of these remaining cards tends to turn up 
in the same proportion as every other card, then the probability 
of a spade among the second cards drawn will be if. This is 
the probability of a spade on the second draw, assuming a spade 
on the first draw. Then, according to the multiplication theorem 
for dependent probabilities, the probability of a spade on both 
the first and second draws is (il)(Tf) = A* The odds will be 
fair, therefore, if you pay $48 every time two spades are drawn 
and your opponent pays $3 every time any other combination 
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is drawn. Again the assumptions would have to be checked in a 
real game. 

Examples Involving Continuous Distributions, All thi exam- 
ples so far have been concerned with discrete probability distribu- 
tions. Much of the practical work in statistics, however, is 
concerned with continuous distributions. As the first example of 
this kind, consider the distribution of heights of eighteen-year-old 
boys. The fitting of a normal curve to the heights of 300 
eighteen-year-old Princeton freshmen^ suggests that in general 
the forces of nature are such as to cause a normal'^ distribution 
of heights. If this is assumed to be the case, then the normal 
curve can be employed to calculate the probability of an eighteen- 
year-old boy having a height lying between any given range. 
This is done as follows: 

As indicated above, ^ if the distribution of probability follows 
the normal law, then the probability of an attribute ranging from 
x/<F to x/iT + d{x/(/) is given by the formula 



where x/a represents a deviation of the attribute from the mean 
attribute measured in <r units, <r is the standard deviation of the 
distribution, and dix/o) is an infinitesimally small range. This 
represents approximately the area under the curve for the infini- 
tesimal range x/a to x/(t + d(x/(r), A finite range running, say 
from Xi/c to X 2 /(r, can be conceived as made up of a number of infini- 
tesimal ranges of size dix/a); and the probability of an attribute 
ranging from xi/cr to X 2 /cr is (by the addition theorem) merely the 
sum of the probabilities for each of these infinitesimal ranges, viz,, 


Xi 



or, in the notation of the integral calculus, 



^ See pp. 295-306 and especially Fig. 101. 
• See pp. 264-267. 
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In other words, the probability of an attribute ranging from 
Xi/<r to X 2 /<t is simply the area under the curve for this range. 
This is* graphically shown in Fig. 94 and is a direct result of the 
addition theorem. 

The area under the normal curve for any given range, might, as 
indicated, be found by evaluating the ''integral' 



This is not an easy task, however, even for those who understand 
advanced mathematics. Consequently, tables have been pre- 



Fia. 94. — Illustration of computation of probability of an x/<r lying between 

xi/<r and xz/tr. 

pared that give the approximate areas under the normal curve 
for certain specified ranges and that permit by simple arithmetical 
operations the calculation of areas for all other ranges. Such a 
table is Table VI of the Appendix, page 693. This gives the 
proportionate area under the positive half of the normal curve 
from the mean {x/cr = 0) to various selected points. Thus from 
the table it is seen that the proportion of the area lying under the 
normal curve from a;/<r = 0 to a:/<r = 0.2 is 0.07926. 

In addition, since the proportion of the area under the normal 
curve from x/tr = 0 (the mean) to infinity is 0.50000, the pro- 
portion of area under the curve from any selected point to infinity 
can be readily calculated. Thus the proportion of the area under 
the curve from x/o- = 0.2 to infinity is 0.42074 (r.e., 0.50000 — 
0.07926), the proportion of area from xj^ = 1.96 to infinity is 
0.02600. (i.6., 0.60000 — 0.47500), etc. Owing to the symmetry 
of the curv^, the same values hold true for areas from x/<r = 0 to 
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x/if = — 00 . Thus the proportion of the area for the range from 
x/<, = -0.2 to j;/a- = - 00 is 0.50000 ~ 0.07926 = 0.42074. 

To find proportionate areas for other ranges, it is ndjessary 
merely to add or subtract proportionate areas given directly by 
the table. Thus, the proportion of area from the range x/<r = 0.2 
to aj/o* = 0.3 is the difference between the proportionate area 
from x/a = 0.3 to the mean, and the proportionate area from 
x/cr = 0.2 to the mean, i.e., 0.11791 — 0.07926 = 0.03865. Like- 
wise, the proportionate area under the curve for the range 
x/<r = —0.2 to x/<T == +0.3 is simply the sum of the propor- 
tionate area from x/c = —0.2 to x/a = 0, and the proportionate 
area from a;/(r = 0 torc/<r = 0.3, f.c., 0.07926 + 0.11791 — 0.19717 . 
Proportionate areas for obscure points not given directly or 
indirectly by the table may be obtained by interpolation; usually, 
straight-line interpolation (i.e., the calculation of simple pro- 
portionate differences) gives satisfactory results. 

To make use of Table VI in a given problem it is merely neces- 
sary to convert the original measurements into deviations from 
the mean expressed in <r units, t.e., to convert original units into 
OP units. The mean height of eighteen-year-old boys, for example 
(as estimated from the heights of eighteen-year-old Princeton 
freshmen of the class of 1943), is 70.47 inches, and the standard 
deviation of heights is 2.49 inches. Hence the probability of an 
eighteen-year-old boy 72 to 73 inches tall is given by the area 
under the normal curve from 


X 72- 70.47 _ 
<r 2.49 


to 


a; __ 73 - 70.47 
cr 2.49 


1.02 


This, in accordance with the method outlined in the previous 
paragraph for calculating such an area, is 0. 1 1707. Similarly, the 
probability of a boy taller than 74 inches is given by the area 

X 74 — 70 47 

under the normal curve from ~ ^ 4 9" " — ~ infinity, 

which the table shows to be 0.50000 — 0.42220 = 0.07780. 
Again, the probability that two boys picked at random should be 
taller than 74 inches is the product of the two individual proba- 
bilities (the multiplication theorem for independent proba- 
bilities) or 

(6,07780) (0.07780) = 0.00606 

Table VI thus readily* facilitates the calculation of probabilities 
whenever the primary distribution or distributions follow the 
normal law. 
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SYMMETRICAL BINOMIAL DISTRIBUTION AND THE 
NORMAL CURVE 

INTRODUCTION 

The preceding chapters have been concerned with probability 
and the probability calculus. These were discussed for the 
purpose of providing tools for subsequent analysis. In this 
chapter the tools will be employed in developing a theoretical 
explanation of the normal frequency curve. The line of attack 
will be as follows. 

The argument will begin with an abstract study of a simple 
problem in combinatorial analysis. The basic data will be 10 
coins, each of which has two sides. These sides will be marked 
with a head or a tail, and each coin will have one head and one 
tail. 

The cpmbinatorial problem will be the determination of the 
relative frequencies or probabilities of various types of combi- 
nations in the whole set of combinations that might be made 
from various arrangements of the given set of coins. Thus 
the theoretical problem will be the determination of the relative 
frequencies or probabilities of combinations having 0, 1, 2, , 

10 heads in the whole set of combinations that might be con- 
structed from various arrangements of the 10 coins. 

In the terminology of probability^ this combinatorial problem 
consists of the derivation of a certain second-order probability 
set from the elementary probability set. To put this in another 
way, the problem is to find the type of frequency or probability 
distribution that results from the combination of certain elemen- 
tary frequency or probability distributions. Attention will in 
particular center upon the form of the derived frequency or 
probability distribution. Exact and approximate formulas will 
be determined for this distribution. 

The purely theoretical part of the theory of the normal.curve 
will thus be a set of problems in the probability calculus. What^ 
* 279 , 
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is ultimately desired, however, is the explanation that this 
distribution affords of some of the frequency distributions that 
appear in real life, such as the frequency distributions 'of the 
heights of adult white males, the frequency distribution of 
samples from a given population, and the like. This explana- 
tion will be undertaken after the completion of the combinatorial 
analysis. 


SYMMETRICAL BINOMIAL DISTRIBUTION 

Derivation. As already suggested, the discussion of the theory 
of the normal frequency curve ivill begin with the analysis of a 
simple problem involving 10 coins. Each coin, it will be assumed, 
has two sides, one of which is a head, the other a tail. Since the 
probability of an object has been defined as its relative frequency 
in the set of objects to which it belongs, it may be said that for 
each coin the probability of a side being a head is ^ and the 
probability of its being a tail is also i. The problem to be 
considered is this: If the 10 coins are combined in all possible 
ways, the selection of a head or a tail for any one coin being 
independent of the selection for other coins, what are the various 
types of combinations of heads and tails that will be produced 
and what will be the probability of each type in the set of all 
possible combinations? This is a straightforward problem in the 
theory of combinations and may be solved as follows. 

To facilitate the analysis, let the 10 coins be distinguished 
by the letters A, B, C, D, E, F, G, H, I, and J. A combination 
having 0 heads, for example, will be represented as follows. 



a combination having 4 heads as follows, 

A BC DEPQH I J 
HTTHHTTTHT 

etc. 

Conidder first the combination having 0 heads. Since the 
probability of a tail on each coin is i, the probability of 0 heads 
is (i)^^ For the probability of A being a tail is i, and the same 
is true for B, C, D, E, F, 0, H, 7, and J. Furthermore, mnce the 
.probability of a tail for .wy one coin is independent of what 
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the other coins are, the probability of the above result is, by 
the multiplication theorem, the product of the 10 independent 
probabiUties, or (i)(i)a)(i)(i)a)a)(i)(i)a) = (iy\ Finally, 
it is to be noted that this result can be obtained in only one way. 
Hence it is to be concluded that the probability of 0 heads is 
1/21® = 1/1,024. 

Consider next the following combination: 

ABCDEFG II I J 
HTTTTT.TTTT 


This is a case of 1 head. Since the probability of A being 
a head is ^ and the probability of each of the other coins being a 
tail is also i and since each of these results is independent of the 
others, it follows that the probability of this particular com- 
bination of heads and tails is again (i)i®. But there are also 
other combinations having only 1 head. Such are 


A B 

T H 

T T 

T T 


C D 

T T 

H T 

T T 


E F 

T T 

T T 

T T 


G H 

T T 

T T 

T T 


I J 

T T 

T T 

T H 


In fact, it is readily seen that there are 10 combinations 
altogether in each of which a different coin is the one being a 
head. The probability, therefore, of any one of these 10 com- 
binations, i.6., the probability of a head on some one and only 
one of the 10 coins is, by the addition theorem, 10(i) i® = 10/1 ,024. 

Consider now the combination 


ABCDEFG II I J 

H H T T T T T T T T . 

This is a case of 2 heads. Since the probability of A being a 
head is i, the probability of B being a head is i and the prob- 
ability of each of the other coins being a tail is likewise i; and 
since each of these results is independent of all the others, it 
follows once more that the probability of this particular com- 
bination is (i)^®. 

But, as previously, this is not the only combination having 
2 heads. The reader himself will be able to write down a number 
of other combinations in which only 2 heads appear. The 
question is how many different combinations of the 10 coins 
have 2 ^d only 2 heads? This is answered by the theory o^ 
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pennutations and combinations outlined in Chap. X.' Thus 
the number of different combinations of 10 coins taken ^2 at a 
time is 


Cl" = 


10 ! 

218! 


45 


[Cf. Eq. (3), page 234.] There being, therefore, 46 different 
combinations, each of which has a probability of (i)*®, it follows 
that the probability of any one of them is 


45 



45 

1,024 


The probability of other possible combinations is determined 
in a similar manner. In general, the probability of Ni heads is , 


(10)1 AV* 

(JVi)!(10 - Nt)\ \2j 


Thus the probability of 3 heads and 7 tails is 


JO! AV® _ 
3!f!V2/ 



The probability of 6 heads and 4 tails is 


10! AV® 

4!6! V2/ 



etc. The results obtained by use of this formula may be tabu- 
lated as follows; 


Tablk 19. — Pbobabilities of Various Combinations among Aix, Possible 
Combinations op 10 Coins 


Combinations Having 

Probability 

0 head 

1/1,024 = 0.00098 

1 head 

10/1,024 - 0.00977 

2 heads 

45/1,024 - 0.04394 

3 heads 

120/1,024 = 0.11719 

4 heads 

210/1,024 » 0.20608 

5 heads 

252/1,024 0.24609 

6 heads 

210/1,024 = 0.20608 

7 heads 

120/1,024 - 0.11719 

8 heads 

45/1,024 - 0.04394 

9 heads 

10/1,024 - 0.00977 

10 heads 

1/1,024 ~ 0.00098 


^ ^ Sbs pp. 232— '234. 
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It will be noted that the series of probabilities of 0, 1, 2, , 

10 he 2 |is may be obtained by the expansion of {i + This 
distribution of probability is consequently called a “binomiaF’ 
distribution.^ If N coins had been used instead of 10, the 
probabilities of the distribution would have been given by the 
terms of the expansion of Thus the probability of 

a combination having Ni heads among all possible combinations 
of N coins is^ 

or if Ar2 is set equal to N — Nij 

- idwT! (5)” 


This is the general formula for a symmetrical binomial dis- 
tribution. 

Character of the Symmetrical Binomial Distribution. A 

graph of the probabilities given in Table 19 is presented in Fig. 
95. It will be noted from the table and also from the figure that 
the probability of 0 heads is the same as the probability of 10 
heads, that the probability of 1 head is the same as the prob- 
ability of 9 heads, etc. In other words, the distribution of 
probabilities is symmetrical about a central point, in this case 
the point representing 5 heads. This symmetry is the reason 
for the name symmetrical^^ binomial distribution. 

Mathematical analysis shows that in general the symmetrical 
binomial distribution has the following characteristics:® 


Mean = ^ 



i32 = 3 - 


N 



‘C/. p. 234. 

* Ibid, 

’ These formulas are derived in Smith and Duncan, Sampling Statistics,^ 
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It will be sufficient to check these equations here by finding the 
mean, standard deviation, /3i, and of the distributjon of 
Table 19. 



Fio, 95. — Graph of a symmetrical binomial distribution. 

The mean of a distribution of probability, it will be recalled, 
is equal to the sum of the attributes times their probabilities.* 
The mean of the distribution of Table 19 is thus 


1 


1,024 
252 


( 0 ) + 


10 


+ 


1,024 


1,024 

( 5 ) + 


( 1 ) + 


45 


1,024 


( 6 ) + 


1,024 
120 


( 2 ) + 


120 


1,024 


1,024 

( 7 )+ « 


( 3 ) + 


210 


1,024 


( 4 ) 


1,024 


( 8 ) + 


10 


1,024 


( 9 ) 


1,024 


(10) - 5 


According to the formula, the mean equals N/2 — ^ = 5, which 
is the same value as that derived above by direct calculation. 

Similarly, the variance of a distribution of probability is equal 
to the sum of the deviations from the mean squared and multi- 
■ * See p. 189. 
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plied by their probabilities. Hence, the variance of the dis- 
tribr^ion of Table 19 is 

1 / »r\o I 19 / I 45 


(-5)* + 


1,024 


(-4)* + 


1,024 




120 


1,024 

(—1)* + — - (0)® + (1)* 4- 

^ 1,024 ^ ^ ^ 1,024 ^ ^ 1,024 '' '' ^ 1,024 

+ -^4^ (3)* + (4)* + (5)* = 2.5 


(2y 


1,024 


1,024 


1,024 


This again checks with the formula, which gives cr^ = JV/4 = 2.5. 

Likewise, the third moment about the mean of a probability 
distribution is the sUm of the deviations from the mean cubed 
and multiplied by their probabilities, and the fourth moment is 
the sum of the deviations from the mean raised to the fourth 
power and multiplied by their probabilities. Thus, for the dis- 
tribution of Table 19, 


1,024 ^ 1,024 ^ 1,024 ^ 


+W4(2)’ + 




45 

1,024 


1,024 


(3)» + 


10 

1,024 


1,024 


(4)’ + 


1,024 


(5)» = 0 


and 


At4 = 
+ 


" 1,024 ^ 1,024 ^ 1,024 ^ 

(_2')4 4- f — iy -I- (OV + 

1,024 ^ ^ 1,024 ^ ^ 1,024 ^ 

+ ( 2 )^ + r4L ( 3 )^ + rJL ( 4 )^ + 


1,024 


1,024 


1,024 


4 


210 

1,024 

1 

1,024 


( 1 )‘ 

(5)^ = 17.5 


Since, by definition, ffi = Ms/m 2 and ^2 = mVmL it follows that for 

this distribution fit is zero and /S 2 = ( 2 %® ~ which are the 

values again given by the general formulas. These formulas are 
valid for all symmetrical binomial distributions. 

The Normal Curve. If 40 instead of 10 coins were involved, 
the distribution of probability would be considerably more 
spread out than that of Table 19. This is readily seen from 
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Fig. 96. In general, the formula <r = JV/4 indicates that t^ 
dispersion of the distribution increases in proportion to«\/JV. 
If the horizontal scale is reduced, however, and the vertical scale 
enlarged, in the same proportion that the dispersion of the dis- 
tribution is increased, then the effect of increasing N is to bring 
the ordinates of the distribution closer together and to raise them 

0240 - • { 

• Graphs of the 

Q 22 Q - S B/nomia/ Disfr/bufions 

no - , I i - 

‘“i ill 

0.160 ■“ I } W/fh ordfnafes and 
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0 1 2 3 4 5 6 7 8 9 10 

Fig. 96. — Graphs of two symmetrical binomial distributions, one for iV" = 10, 
the other for iV = 40. 
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Binomial Distributions 
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40! 


Wr4(hbi)! 




With ordinates and 
abscissa points in 
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n 
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to the height of the original distribution. Under these condi- 
tions the tops of the ordinates tend to sketch out a smooth curve 
as JV is increased. This is indicated in Fig. 97. It can be shown 
that the limit that the symmetrical binomial distribution 
approaches as N is increased, while at the same time the scales 
are adjusted in proportion to \/iV, is the normal curve 


y = 


-~a;^ 


a \/2 tc 

That the symmetrical binomial distribution approaches the 
formal curve as a limit can be definitely proved by rigorous 
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mathematical analysis. ^ Certain general considerations, however, 
sugg^t this same conclusion. 

1. The distributions of Figs. 95 to 97 have a shape similar to 
that of the normal curve; and if a normal curve with the same 



Fig. 97. — Illustration of effect of scale adjustments on a symmetrical binomial 

distribution. 

mean as any one of these distributions and the same standard 
deviation is graphed together with that distribution, the curve 
is seen to be a good ‘‘fit.’’ This is shown^ in Fig. 98. 

^ This is demonstrated in Smith and Duncian, Sampling Statistics, pp. 68-74. 

* The binomial distribution is a discrete distribution, and its probabilities 
are correctly represented by a series of ordinates as in Figs. 96 and 97. It 
is the ordinates of the normal curve of Fig. 98 at l/<r, 2/<r, etc., and not 
sections of the curve area that approximate the binomial ordinates at these 
points. As pointed out, however, in Smith and Duncan, Sampling Statistics, 
it is possible to represent any symmetrical binomial distribution by a histo- 
gram whose area is approximated by that of a normal curve. In this way 
the area tables of the normal curve can be used to approximate a series of 
binomial probabilities. 
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Fig. 98. — Comparison of a symmetrical binomial distribution with the normal 

curve. 



Fig, 99.— ^raph illustrating computation of relative slope of frequency polygon 

at a given point. 
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2. Equations (2)* show that = 0 for the symmetrical 
binoAial distribution and that approaches the value 3 as iV 
is increased. These are also the values of and P 2 for the 
normal curve. 

3. If a graph is made of the symmetrical binomial distribution 
in the form of a frequency polygon, the relative slope of any side 
of this polygon at its mid-point is the same as the relative slope 
of a normal curve at that point. Figure 99 shows, for example, 
that for N = 10 the ordinate of the symmetrical binomial at 
iVi = 6 is equal to 210/1,024 and the ordinate at iVi = 7 is 
120/1,024. The mid-point between 6 and 7 is 6.5, and the 
ordinate of the polygon at that point is 


1 / 210 120 \ 165 

2 \1,024 1,024/ 1,024 

The absolute slope of the polygon at this mid-point is given by 
the ratio of the difference between the ordinates at 7 and 6 (that 
. 120 210 
r;024 1,024 

points 6 and 7 (that is, 7 — 6 = 1) ; and the relative slope at the 
mid-point is given by the ratio of the absolute slope to the ordi- 
nate at that point. Thus, the relative slope of the polygon of 
Fig. 99 at the mid-point 6.5 is 



( 


90 \ 
■ 1,024 I 
165 

1,024 / 


90 

165 


-0.545 


In general, the relative slope of a symmetrical binomial distribu- 
tion at any mid-point iVi -1- ^ is equal* to 

2 

IN 

If X is set equal to iVi -1- g — that is, the deviation of the 

mid-point from the mean and if 2fc* is set equal to — g — ’ 
* Page 283. 

' See Sttrra Snd Duncan, Sampling Sixitiitiea. 


N -b 1 
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this expression for the relative slope at point x becomes — or 


— But it can be shown^ by the differential calculus that the 

relative slope of the normal curve at any point x is — where 
a; = X — X. Hence the relative slope of the symmetrical 
binomial distribution at any mid-point is the same as t hat of a 

normal curve whose standard deviation is equal to & = ^ 

which, if N is large, is practically the same as the standard 
deviation of the symmetrical binomial distribution. ^ 


CONDITIONS PRODUCING THE SYMMETRICAL BINOMIAL AND 
THE NORMAL CURVE IN REAL LIFE 

The foregoing sections have been devoted to the derivation and 
' description of a particular frequency distribution known as the 
binomial distribution. The analysis has consisted entirely of an 
application of the probability calculus, and the result is an 
abstract distribution of probability. Since the ultimate purpose 
of the analysis is an explanation of some of the frequency distri- 
butions of real life, it is desirable at this point to consider the 
question: What is the relationship between the symmetrical 
binomial distribution and a frequency distribution of real life? 

Consider first the following hypothetical experiment: Suppose 
that the 10 coins referred to in the theoretical discussion are 
tossed a large number of times and the relative frequencies with 
which they come up 0, 1, 2, ... , 10 heads are computed. 
What will be the results? Actually, no precise prediction can be 
made. Intuition suggests, however, that, if the coins are sym- 
metrical and are tossed in an unbiased fashion, the relative 
frequencies with which the combinations 0, 1, 2, . . . , 10 heads 
will appear will be close to the probabilities of these combinations 
among the whole set of combinations that could be made from 
10 coins. For if coins are tossed at random, it is to be expected 
that a head will appear on any one coin as often as a tail. The 
randomness also ensures that the appearance of a head on one 
coin will be independent of the appearance of a head or a tail on 
any other coin. Under these conditions it would seem likely 

^lUd. 

• * See Eqs. (2), p. 28&. 
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that any particular arrangement of heads and tails would occur 
just Its often as any other arrangement. Therefore, the relative 
number of times 3 heads and 7 tails would appear, for example, 
would be equal to the relative number of arrangements that 
H would yield 3 heads and 7 tails out of the set of all possible 
arrangements. This is the relative frequency of the binomial 
distribution. Intuition thus suggests the results of random 
coin tossing will be closely approximated by the binomial fre- 
quencies. Actual experiments lend support to this argument, 
so that it would seem possible to predict the results of a large 
number of tossings by the use of the probability calculus. This 
is merely an application of the law of large numbers. 

The relationship between the results of coin tossing and the 
binomial probabilities suggests even more important inferences. 
For there may be conditions in real life that are similar to those 
involved in the tossing of coins, and statistical variables produced 
by these conditions may be expected to follow the symmetrical 
binomial distribution and in special instances the normal curve. 
To illustrate the conditions that might give rise to such results 
consider the following examples: 

Example 1. Suppose that the sex of the offspring of a certain animal is 
determined by the type of the egg cell in the female that unites with the 
sperm cell of the male, and suppose that the number of egg cells in 
the female that will produce male offspring is on the average equal to the 
number of egg cells that will produce female offspring. If sperm cells unite 
with egg cells in a random manner, the chance is \ of an offspring being a 
male and i of its being a female. These are essentially the same conditions 
that determine whether a symmetrical coin should turn up heads or tails 
when tossed at random. Under such conditions the frequency distribution 
of the number of males in families of a given size should theoretically follow' 
the symmetrical binomial distribution. Thus families of size 5 should be 
expected to vary in sex combination as follows: 

Number Percentage of Families Having 

of Males Specified Number of Males 

0 A = 0.03 

1 A = 0.16 

2 M = 0.31 

3 fS * 0.31 

4 A = 0.16 

5 . A « 0.03 


A study of tbs sex of pigs in 116 litters of 5 pigs each showed the following? 
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■Number 
of Males 
0 
1 
2 

3 

4 
6 


Percentage of Litters Having 
Specified Number of Males ^ 

0.02 
0.17 
0.35 
0.30 
0.12 
0.03 


The closeness of these figures to those above suggests that the theory of 
sex determination outlined above might very well be valid for pigs. 

Example 2. In Example 1, conditions were such as to produce a variable 
(number of males) that was discrete and integral. The present hypothetical 
example will suggest conditions which might produce a variable that was 
discrete but not integral and that was distributed in the form of a sym- 
metrical binomial. It also suggests conditions under which the variable 
might be practically continuous and distributed like a normal curve. 

Suppose that there are a large number of bags of flour, say 10,000, each 
weighing exactly 6 pounds. Suppose that an experimenter opens each bag 
in succession and adds or subtracts a certain quantity of flour to each bag 
in accordance with the following rule: Whenever he opens a bag, he also 
tosses 10 coins. For each head that appears he adds an ounce of flour to the 
bag; for each tail, he subtracts an ounce. The result of this procedure will 
be 10,000 bags of flour varying in weight from 5 pounds — 10 ounces to 
6 pounds -f- 10 ounces, the unit difference being 2 ounces. In accordance 
with the foregoing analysis, the distribution of the weights of these bags of 
flour would be approximately as follows: 


Weight of Bag 

Relative Frequency of Occurrence 

4 lb. 6 oz. 

1/1,024 

4 lb. 8 oz. 

10/1,024 

4 lb. 10 oz. 

45/1,024 

4 lb. 12 oz. 

120/1,024 

4 lb. 14 oz. 

210/1,024 

5 lb. 0 oz. 

252/1,024 

5 lb. 2 oz. 

210/1,024 

5 lb. 4 oz. 

120/1,024 

5 lb. 6 oz. 

45/1,024 

5 lb. 8 oz. 

10/1,024 

5 lb. 10 oz. 

1/1,024 


In other words, the distribution of weights would approximately conform 
to a symmetrical binomial distribution with a mean weight of 5 pounds and a 
standard deviation of 2.5 X 2 » 5 ounces. 

This showB how a variable may be produced that is discrete but not 
integral and that is distributed in the form of a symmetrical binomial 
distribution. To produce a variable that is practically continuous, it is 
necessary to increase the number of coins from 10 to 100, say, and to reduce 
. ^le amount of flour added or subtracted to 0.01 ounce, c Differences as 
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small as 0.02 ounce would thus be possible, and for all practical purposes 
the i|eight of a bag of flour could be said to be continuous. Under these 
conditions a graph of the distributions of weights would be practically con- 
tinuous and as indicated in the theoretical 
discussion would have the form of a normal 
frequency curve. 

Example 3. Example 2 was entirely 
hypothetical. An apparatus has been 
constructed, however (see Fig. 100), that 
reproduces in somewhat different form the 
conditions of Example 2. By its use the 
results predicted in Example 2 can be 
concretely illustrated. 

The apparatus of Fig. 100 was devised 
many years ago by Sir Francis Galton and 
subsequently elaborated by Karl Pearson.^ 

As represented in Fig. 100 it consists of a 
series of rows of wedges, each row contain- 
ing an additional wedge and so arranged 
that its wedges come halfway between the 
wedges of the row above. If the wedges 
are placed 1 centimeter apart, then a small 
ball dropped into the top of the machine 
will have an equal chance in each row of 
being deflected 0.5 centimeter cither to the 

left or the right. The apparatus of Fig. -5 -4 .3 .7 -j n i 7 1 4 
100 has 10 rows. The final deviation of 

the ball from the central point 0 will thus Fjq, 100 , The Pearson-Gal- 

be the algebraic sum of the left (minus) ton apparatus for physical 
and right (plus) deflections as it falls derivation of a binomial distri- 
through the 10 rows. The possible range ^ 

of this final deviation is from —5 to -f*5 centimeters. Since the probability 
of a plus and minus deviation of 0.5 centimeter is in each row equal to i 
(similar to the probability of a head and a tail for a coin) and since there 
are 10 rows (as there were 10 coins in the previous case), the probabilities of 
final deviations of —5, —4, —3, —2, —1, 0, +1, 4-2, +3, +4, +5 centi- 
meters will be the same as those of the binomial distribution, 

«/xrN_ 10! /IV® 

P(Ni) - (^2 j 

which are given in Table 19, page 282. 

^ Galton, Francis, Natural Inheritance (Macmillan & Company, Ltd., 
London, 1889), p. 63; Pearson, Karl, “Skew Variation in Homogeneous 
Material, Philosophical Transactions of the Royal Society of London, Series 
A, Vol. 186 (1895), p. 343. Pearson's contribution was to replace the set of 
nails used by Galton by a set of sliding wedges that could be so adjusted 
that the chances of deflection to the left and right were not equal. Figure 
100 follows theipattern of Galton's apparatus. • 
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These are the theoretical probabilities of the apparatus. If a large 
number of balls are actually dropped into the machine, the exact pesult 
cannot be predicted. Intuition suggests, however, that the relative fre- 
quencies with which the balls will pile up in the different slots will tend to 
approximate the theoretical probabilities and this is demonstrated by actual 
experiments. Such a result is pictured in Fig. 100 by the shading of the 
slota^ proportion to the binomial probabilities. 

It will be noted that in this case the variable, that is, the final deviation 
of a ball from the central point 0 is again discrete. Deviations of integral 
centimeters only are possible. If, however, the number of rows were 
increased from 10 to 1,000, say, and if at the same time the wedges were 
reduced to 0.01 centimeter in size and placed so that they were only 0.01 
centimeter apart (the balls would, of course, have to be correspondingly 
reduced in size), then the final deviations would vary by 0.01 centimeter 
ancl might be practically considered a continuous variable. The distri- 
bution of relative frequencies would in this case closely approximate a 
smooth frequency curve, which would once again be the normal curve. 

Theory of Errors. Errors in physical measurements may be broken up 
into several components. (1) The ^‘instrumental error may be attributed 
to the particular instrument with which the measurement is made; every 
measurement by it will contain a certain error that may be assigned to that 
instrument. (2) The “personal error may be attributed to the particular 
person undertaking the measurement; every observation by him will be 
influenced by his “personal equation.” (3) Another component error may 
be attributed to particular external conditions, such as the temperature, 
sunlight, and wind. These errors due to the instrument, the observer, and 
specific external conditions are all “systematic errors” that can be allowed 
for, (4) A final component error is the “incidental error,” or “residual 
error,” to which no definite cause can be assigned. Such errors are the 
result of the whole host of chance forces, the same sort of forces that deter- 
mine whether an unbiased coin comes up heads or tails. The total acci- 
dental error in any individual measurement may be taken to be the sum 
of a number of small accidental errors arising from different causes,^ Slight 
irregular changes in external conditions, such as the vibration on account of 
air currents or irregular changes in the personal equation of the observer, 
are examples of causes for accidental error of measurement. If it is pos- 
sible to discover the law of action of any error, it is thereby removed from 
the class of accidental errors to the class of systematic errors. 

If the number of forces affecting the residual errors in any series of 
measurements is large, if each causes a very small plus or minus deviation 
from the true value, and if the probability of a plus and a minus deviation 
is for each force equal to J, then, as in the case of the flour-bag experiment 
and the Pearson-Galton apparatus, the final residual errors of the series of 
measurements will tend to be distributed in accordance with the normal 
curve. ^ The mean of this curve will be the true value (after allowance has 

t 

^See Smith and Duncan, Sampling Statistics, also see Brunt, David, 
The Combinations of Observations, pp. 3-4. c 
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been made, of course, for the systematic errors mentioned above), and the 
standard deviation of the curve will be an index of the precision of measure- 
ment. This is the theory of errors.^ It is supported by the close agreement 
between the normal curve and distributions of actual measurements. In 
fact, the normal curve is often spoken of as the “error curve'* or the “Gaus- 
sian error curve," after the man who was among the first to recognize the 
possibility of applying the theory of probability to the investigation of the 
errors of measurement.* 

Summary of Conditions Leading to the Symmetrical Binomial 
Distribution and the Normal Curve^ The foregoing examples 
suggest that whenever the following conditions exist in real 
life, the data generated by these conditions will tend to be dis- 
tributed in the form of a symmetrical binomial distribution and, 
if certain other conditions are also present, in the form of a 
normal curve. The conditions giving rise to the symmetrical 
binomial distribution may be stated as follows: 

1. In the absence of certain ''causes^^ of variation or in the 
event of a perfect balancing of their effects, the data assume a 
fixed central value. (The 5 pounds of the flour illustration, the 
‘‘true value in a series of measurements.) 

2. Deviations from this central value result from certain 
“causes^' of variation, the effect of any “ cause being either to 
add a fixed quantity to the data or to subtract the same quantity. 
(To add or subtract 1 ounce of flour or to add or subtract an 
“error’’ of 0.5 centimeter.) 

3. A “cause” of variation tends to produce positive effects and 
negative effects in equal proportion, that is, P(+) == P(— ) = i. 
(The probability of a head equals the probability of a tail; the 
probability of a positive error equals the probability of a negative 
error.) 

4. The effects of all contributory causes of variation are of 
equal magnitude. (Each adds or subtracts 1 ounce of flour 
or 0.5 centimeter.) 

1 Actually this is a special case of a more general statement of the theory. 
As pointed out in Smith and Duncan, Sampling Statistics, each force may 
-cause deviations of varying size with varying probabilities and the final 
residual errors will still tend to be normally distributed provided that the 
number of forces is very large and the relative importance of each is about 
the same. 

* For Gauss's fundamental works see Abhandlungen zur Methods der 
kleinsten Qmdrqjte (A. Borsch and P. Simon, Berlin, 1887). » 
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5. The contributory causes are independent in their action. 
In other words, the contribution of a positive or negative effect 
by any causal factor is independent of the previous contributions 
of other causal factors. 

6. The total deviation of any element from its central value 
is the algebraic sum of the positive and negative contributions of 
the individual causal factors. (The total amount of flour added 
or subtracted from a bag is the sum of the ounces added for each 
head tossed minus the ounces subtracted for each tail tossed.) 

If in addition to these conditions, the following also exist, then 
the resulting distribution will tend to conform to the normal 
curve: 

7. The number of contributory causes is very large. (A 
large number of coins are tossed; the binomial machine contains 
a large number of rows.) 

8. The positive and negative contributions of each cause are 
very small. (If 0.01 ounce is added or subtracted instead of 1 
ounce; if 0.005 centimeter, instead of 0.5 centimeter.) 

It is to be noted that, so far as the normal curve is concerned, 
not all these conditions are necessary for its generation. The 
above conditions will produce it, but the normal curve may also 
occur when some of these conditions are absent.^ It may be 
stated here that the normal curve will still be produced if con- 
ditions 2 and 3 are relaxed so that a causal factor may affect the 
data in varying degree and with varying probabilities and also if 
condition 4 is only approximately and not exactly true.^ The 
most important conditions are 6 to 8 and condition 4 in an 
approximate form. For example, in the case of the flour illus- 
tration, the resulting weights of the bags of flours would still tend 
to be normally distributed even if biased dice instead of unbiased 
coins were used and if the amount of flour added or subtracted 
varied with the result of the throw (say 0.001 ounce for the 
occurrence of a one, --0.002 ounce for the occurrence of a two, 
0.003 ounce for the occurrence of a three, —0.004 for the occur- 
rence of a four, etc.) provided that the number of dice thrown 
was very large and the amount added or subtracted per die was 
very small and of about the same order of magnitude from die to 

1 See Smith and Duncan, Sampling Statistics. 

‘ Under certain conditions the requirement of independence (condition 5) 
may also be relaxed. See Smith and Duncan, Sampling St/xtistics, 
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die. The normal curve is thus a more general phenomenon than 
the symmetrical binomial distribution.^ 

Examples of Normal Frequency Distributions. Natural forces 
appear to generate normal frequency distributions in many fields. 
Physical measurements have already been mentioned. Figure 
101 shows the distribution of heights of 300 eighteen-year-old 
Princeton freshmen. The grades of students on examinations, 
hourly earnings of workers, the length of life of electric- 
light bulbs, the distance of baseball throws of first-year high- 
school girls are all normally distributed variables. In these 
fields and in many others, it would seem that the conditions of 
variation are those which theoretically give rise to the normal 
curve. 



Fig. 101. — Normal curve fitted to heights of 300 Princeton freshmen. 

DETERMINATION OF NORMALITY 

Several procedures are available for determining whether the 
population from which a given set of sample data has been taken 
might reasonably be considered to conform to the normal curve. 
In general, these consist of comparing the histogram constructed 
from the sample data with a normal curve ‘'fitted'^ to this histo- 
gram. The difference in the various procedures lies in the bases 

^ Mathematically the normal curve can be derived from a great variety of 
different assumptions. See, for example, Czubeb, Emanuel, Theme der 
Biobachtwn^sfdler (B. G. Teubner, Leipzig, 1891). 
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of comparison. Several of the more important procedures will 
now be discussed. f 

Graphic Comparison. The simplest method of determining 
whether the assumption of normality is or is not reasonable is 
to graph the histogram and normal curve together and see how 
well the curve fits. The test here is purely a subjective one, 
but in many cases when the fit is exceptionally good or excep- 
tionally bad this is probably sufficient for acceptance or rejection 
of the hypothesis. 

In making a graphic comparison of a sample histogram and a 
normal curve, it is necessary to determine what mean and what 
standard deviation should be assigned to the curve. Offhand 
the simplest procedure would appear to be the assignment to 
the curve of the mean and standard deviation of the histogram, 
for presumably these are the best estimates that may be made 
of the mean and standard deviation of the population from which 
the sample was taken. ^ It will be recalled, however, that in the 
calculation of the mean and standard deviation of the histogram 
the data were distributed among various classes or groups* and all 
the cases in any class interval were assumed to be concentrated 
at the mid-point of the interval. But the population is pre- 
sumably distributed in the form of a smooth curve, so that, in 
estimating its mean and standard deviation from that of the 
histogram, allowance must be made for the grouping of the data 
in the construction of the histogram. In any interval a smooth 
bell-shaped ci^rve, such as the normal curve, will have more cases 
that are on the side toward the mean than on the side away from 
the mean. The assumption that all cases are concentrated at 
the mid-point of an interval will not cause any appreciable error 
in the mean calculated from grouped data, for plus and minus 
deviations will offset each other; but it will cause the standard 
deviation of the grouped data to be greater than the standard 
deviation of the smooth curve that represents the true distribu- 
tion of the data. Some adjustment should therefore be made in 
the standard deviation of the histogram before it is taken as an 
estimate of the standard deviation of the population. 

The adjustment that must be made for grouping has been 
determined by W. F. Sheppard. He has shown that under con- 
ditions that are true for a normal distribution the variance 
• ^ * C/, pp. 318 and 319. , 
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of the smooth curve is approximately equal to the variance of 
the gpouped data minus one-twelfth the square of the class inter- 
val.^ In other words/if (uncorrected) is the second moment 
( = of the grouped data about its mean and ^2 is the second 
moment of the smooth curve about its mean, then 

= M 2 (uncorrected) — iV(0^ 

The quantity is Sheppard^ 8 correction for grouping that is 

required for estimating the standard deviation of the fitted 
normal curve. 

In fitting a normal curve to a sample histogram, therefore, the 
mean of the curve is taken equal to the mean of the histogram 
and the variance of the curve is taken equal to the variance of the 
histogram minus In plotting the curve a table of the 

ordinates of the standard normal curve may conveniently be used. 
If the histogram to which the curve is to be fitted is of the usual 
type, that is, if it consists of a series of rectangles of which the 
heights measure aggregate frequencies and if the intervals on 
which these rectangles are erected are laid off in terms of original 
X units, then ordinates of the standard normal curve can be 
taken to represent the particular normal curve desired by making 
certain simple adjustments. The ordinates of the standard 
normal curve, it will be recalled, are given for values of X that 
are measured from the mean of the distribution and are expressed 
in terms of standard deviation units. It will also be recalled that 
the area of the curve over any given interval measures the relative 
frequency of cases falling in this interval. To make these 
ordinates represent a normal curve with a given mean and a 
given standard deviation, they need only be plotted so that the 
ordinate for Z = 0 comes at the specified mean value and 
ordinates for other values of X come at X == X ± a:. To 
put them on the same basis as the histogram, however, they must 
also all be multiplied by Ni/c. This is because the total area 
of the histogram^ is Ni and that of the standard normal curve is 
1 (that is, 100 per cent), whereas the abscissa scale on which the 
histogram is plotted is a times the abscissa scale of the standard 
curve. This use of the ordinates of the standard normal curve 

^ C/. Proceedings of the London Mathematical Society^ Vol. 29, 353-380. 

* The area of any one rectangle is Fi, and the total area is therefore 
SFi « NLr , 
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may be illustrated by fitting a normal curve to the heights of 
300. Princeton freshmen. i 

In Table 20 the. mid-points of the various class intervals into 
which the 300 heights were distributed are set down in column 
(1). In column (2) the difference between these mid-points and 
the mean of the distribution {X = 70.47) is computed, and in 
column (3) this is divided by the adjusted standard deviation. 
The results are the various values of x/a that correspond to the 
mid-points of the various class intervals. The ordinates of the 
standard normal curve at these values of x/a are then computed 
from Table VII (see Appendix, page 694) and entered in column 
(4). Finally, in column (5), these standard ordinates are multi- 
plied by “ = “ 121.46 to put them on a par with the 

sample histogram. 

Test of Goodness of Fit Another method of comparing a 
sample histogram with a normal curve is to compare the fre- 
quencies given by the two, interval by interval. Whereas the 
previous method was primarily subjective in that a conclusion 
had to be reached from a mere inspection of the two graphs, com- 
parison of the histogram and the curve, interval by interval, 
yields a numerical criterion of ^‘goodness of fit.^^ A procedure 
that has found favor because it permits a comparison with chance 
results is to take the difference- between the absolute frequencies^ 
given by the curve and by the histogram for each interval, square 
these differences, divide each by the frequency of the curve for 
that interval, and finally sum the results. The quantity so 

S iF - /)2 

j ) where F repre- 
sents for each class interval the frequency given by the histogram 
and / the frequency given by the curve. 

Sampling theory shows^ that, if this quantity is calculated for 
a large number (theoretically, an infinite number) of sample his- 
tograms from the same normal population, then the distribution 

of these various sample values of ^ ^ ^ 


will be adequately 

represented by a probability curve known as the curve and 
this can be used to determine the probability of a larger value of 


1 For the curve, this means the relative frequencies times N, the total 
4 ^unber of eases in the sample. 

* On the X* distribution see Smith and Duncan, Sampling ^Statistics, 
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Table 20. — Calculation of the Ordinates of the Normal Curve 
TH ftT Fits the Distribution of Heights of 300 Princeton Freshmen 


(1) 

(2) 

(3) 

(4) 

(6) 

X 

X - X • z 

X - It 

Ordinate of 
standard curve 

Col. (4) X — 

<r 

62.5 

-7.97 

-3.22 

0.00224 

0.27 

63.5 

-6.97 

-2.82 

0.00748 

0.91 

64.5 

-5.97 

-2.42 

0.02134 

2.59 

65.5 

-4.97 

-2.01 

0.05292 

6.43 

66.5 

-3.97 

-1.59 

0.11270 

13.69 

67.5 

-2.97 

-1.19 

0.19652 

23.87 

68.5 

-1.97 

-0.80 

0.28969 

35.19 

69.5 

-0.97 

-0.39 

0.36973 

44.91 

70.5 

0.03 

-0.01 

0.39892 

48.45 

71.5 

1.03 

0.42 

0.36526 

44.36 

72.5 

2.03 

0.82 

0.28504 

34.62 

73.5 

3.03 

1.22 

0. 18954 

23.02 

74.5 

4.03 

1.63 

0.10567 

12.83 

75.5 

5.03 

2.04 

0.04980 

6.05 

76.5 

6.03 

2.44 

0.02033 

2.47 

77.5 

7.03 

2.84 

0.00707 

0.86 


X = 70.47 <r (corrected) = 2.47 


2 


iF -f)^ 


f 


by chance. If the probability is a large one, then 


the difference between the given sample histogram and the 

(F - /)2 

normal curve, as measured by ^ y — f may reasonably be 

attributed to chance; the curve may be deemed a good fit, and 
the population from which the sample was drawn may tenta- 
tively be taken as normal. If the probability is very small, 
however, say less than 0.05, then the difference between the 
histogram and the curve is to be attributed to something else 
than chance, presumably to the nonnormality of the population 
from which the sample was drawn. In this case, the normal 
curve is not deemed a good fit, and the hypothesis of a normal 
population is rejected. Owing to its use of the x* curve, this 
second method of comparison is called the “x* test of goodness 
of fit.^' 

The x^' test may be illustrated, as in the previous case, by th«v 
distribution of heights of 300 Princeton freshmen. The numeri- 
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cal calculations involved are set forth in Table 21. In column (1) 
are put the upper limits (not the mid-points in this case) (X the 
various class intervals of the histogram of heights, the first and 
last intervals being considered to run to — <« and + oo , respec- 
tively. The deviations of these class-interval limits from the 
mean of the distribution are calculated in column (2), and their 
ratio to the ‘‘corrected” standard deviation is computed in 
column (3). In column (4) are written the areas under the 
standard normal curve from its lower limit ( — «> ) to each of 
these class-interval limits, now measured in standard-deviation 
units. These areas are obtained from Table VI, page 693, 
of the Appendix. The figure in column (4) already gives the 
area for the first interval ( — oo to 63), and the areas for the other 
intervals can be computed by taking the difference between 
the area under the curve up to one class limit and the area up 
to the next higher class limit. These differences are written in 
column (5). In order to avoid very small areas in the extreme 
intervals (and hence a distortion of the test^), several groups at 
each end are amalgamated so that the areas for the first and last 
interval are at least equal to 5/N, 

The new arrangement is indicated in columns (!') and (6'). 
Now it will be noted that the figures of column (6') represent 
the relative frequencies given by the curve. To convert them 
to aggregate frequencies that are comparable with the aggre- 
gate frequencies of the histogram, it is necessary to multiply 
them by iV (= 300), the total number of cases. This is done 
in column (6). In column (7) are written the histogram 
frequencies .for the intervals of column (!'), and in the remaining 
columns the differences between the histogram and curve fre- 
quencies are computed, squared, and divided by the curve fre- 


quencies. 

desired. 


(F — /)^ 

The sum of the last column is the value of 2^ - — 

In the present instance this is found to be 3.867. 


To determine whether the value of ^ = 3.867 

represents a good fit or not, turn to Table 22. Here are given 
certain critical values for the statistic ' ' 


Wo 2 ;^ 


The n of 


* See ibid. 
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f.|und for each +(X — JP)/«r in Table VI of the Appendix, p. 693. 


3(M 


THE NORMAL FREQUENCY CURVE 


the first column is equal to the number of class intervals minus 3. ' 

^ (/?*_ /)* 

The figures in the second column represent values of ^ — j — 

for which there is a probability of 0.05 that an equal or greater 
value would be obtained by mere chance. For example, in the 
present instance, n = ll— 3 = 8, and Table 22 shows that, if 


the data were truly normal, sample values for 


V2:^.hat 


^ f 

were equal to or greater than 16.51 would be obtained only 
5 times out of 100 for such a value of n. Since the computed 
Y' (F - fy 

value of 2^ ' — = 3.867, the chances of an equal or greater 

^ (F - f)** 

Table 22. — Critical Values for 2^ 

Values of ^ ~ 


author. 


n 

Which the Probability of an 
Equal or Greater Value Is 
Just 0.05 

1 

3.84 

2 

5.99 

3 

7.81 

4 

9.49 

5 

11.07 

6 

12.59 

7 

14.07 

8 

15.51 

9 

16.92 

10 

18.31 

11 

19.67 

12 

21.03 

13 

22.36 

14 

23.68 

15 

25.00 

16 

26.30 

17 

27.59 

18 

28.87 

10 

30.14 

20 

31.41 

Table III, Table of x^ in R* A. Fisher, Statistical M 

\ Boyd, Ltd., Edinburgh, by the kind permission of 


*See Smith and Duncan, Sampling StatiaticSj for an explanation of the 
significance of n in this case. , 
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value is much more than 0.05. Hence the curve is deemed a 
good^t; and the distribution of heights may be said to be normal. 

Comparison of Special Statistics, Although the test just out- 
lined is very commonly used, it has certain weaknesses as a test 
of normality. ( 1 ) It should be noted that the squaring of the 
differences between the group frequencies removes any signifi- 
cance that might be attributed to the signs of the differences. 
For example, it might happen in a given case that all the histo- 
gram frequencies to the left of the center were larger than the 
normal curve frequencies and that all the histogram frequencies 
to the right of the center were less than the normal curve fre- 
quencies, indicating a well-marked positive skewness; neverthe- 
less, if the absolute values of these differences were all small, the 

test might not indicate any departure from normality. ( 2 ) 
The necessity of combining the extreme intervals into larger 
groups causes a loss of information and reduces the number of 
points of comparison.^ For these reasons, other methods of test- 
ing for normality have been proposed. 

If a set of sample data actually has come from a normal popula- 
tion, it is to be expected that its skewness will be slight and its 
kurtosis close to the normal kurtosis of 3. It would also be 
expected that the ratio of its average deviation to its standard 
deviation would be somewhere in the neighborhood of the value of 
this ratio for the normal curve (i.e., 0.7979). The departure of 
the actual values of these sample statistics from the theoretical 
values for the normal curve can thus be used as a test for normality. 

For the 300 Princeton freshmen, ) 8 i, ) 82 , and the ratio of average 
deviation to standard deviation (indicated by the symbol a) had 
the values® fix = 0.023, == 3.021, and a = 0.805. These are 

* Its practical effect is to reduce the value of n to be used in the x* table. 

* No account was taken of Sheppard ^s correction in computing these 
values. The average deviation used in making this test was computed from 
the mean by the formula 

k (9 1 + ^» (i + '=’)] * 

A 

where c » : — ^ Ni * number of cases in intervals below the arbitrary 

% 

origin, N^ “ number of cases in intervals above the arbitrary origin, and 
iVo number of cases in interval containing arbitrary origin. (A must be 
in the same interval as X,) Cf, Geary, R. C., and E. S. Pearson, Tests of 
Normality^ p, 4. 
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all very close to the values 0, 3, and 0.7979 of a truly normal dis- 
tribution. Hence this last, as well as the other tests, suggest^ that 
heights are normally distributed. 

Sometimes the sample values of /3i, /(32, and a are not so close to 
the normal values as in the foregoing illustration. In such 
instances, use may be made of tables published in Tests of Normal- 
hy R. C. Geary and E. S. Pearson. These tables give, for 
various-sized samples, the sample values of fix, and a, for 
which the probability of a greater value is 0.05, and 0.01, respec- 
tively. For 02 and a they also give values of these statistics for 
which the probability of a smaller value is 0.05 and 0.01, respec- 
tively. If, in any given instance, the sample value of jSi, 02, or a 
falls outside the limits given for a probability of 0.05, say, then it 
may be concluded that the population from which the sample was 
drawn was not strictly normal. For the weights of the 300 
Princeton freshmen, for example, the sample values of 0i and 02 
were 0.378 and 4.606. Both these are beyond the 0.01 probability 
point given by Geary and Pearson’s tables for a sample of 300 
(these were 0.329 and 3.79, respectively), and it may therefore be 
concluded that the distribution of weights is definitely not normal, 

1 Issued by the Biometrika Office, University College, London, and 
printed at the University Press, Cambridge, England. 



CHAPTER XII 


USE OF THE NORMAL FREQUENCY CURVE IN SAMPLING 

ANALYSIS 

The normal frequency curve has its greatest usefulness in the 
theory of random sampling.^ While the full exposition of the 
theory of random sampling is beyond the scope of this book, some 
of the simpler aspects that relate to the use of the normal curve 
in sampling analysis are presented in the ensuing pages of this 
chapter. 

SAMPLING FROM A TWOFOLD POPULATION 

The Problem, An elementary problem in the theory of sam- 
pling is concerned with sampling from a twofold population. 
Consider the following problem : Suppose a large city is undergoing 
a fiercely contested election. The Radicals on the one hand and 
the Conservatives on the other are contending hotly for the 
mayoralty, and everyone in the city takes a stand on one side or 
the other. The voting population of the city thus forms a group 
in which a certain percentage are Radicals and a complementary 
percentage are Conservatives. Prior to the election these per- 
centages will not be known. They may, however, be estimated 
by taking a random sample. The inferences that may be made 
from such a random sample constitute the statistical problem 
that will now be analyzed. 

Sampling Distribution. For the sake of argument suppose that 
some omniscient being knew how each individual in the city stood 
politically. Suppose that he noted their positions on slips of 
paper — one for each individual — and put the slips into a large 
urn. Suppose, further, that there are actually an equal number 
of Radicals and Conservatives. Let the omniscient being mix the 
slips of paper thoroughly and then draw out a sample of 100 slips,* 

^ For more elaborate exposition than is contained in this chapter, see 
Smith and Duncan, Sampling Statistics. ‘ 

* Mundane methods of obtaining random samples are discussed in ibid. 
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Let him note the diviaion of opinion for this sample, put the slips 
back, and thoroughly mix them again. Finally, let him repeat 
this process many times, taking a sample of 100 each time, so that 
he eventually accumulates a large number of sample percentage 
divisions of opinion. Many, but by no means all, of these sample 
percentages will be the actual population percentage of 50; the 
others will be distributed above and below the 50 per cent level. 
This will be the sampling distribution of the sample percentages. 

It is one of the important conclusions of the probability theory, 
based upon the analysis of the preceding chapters,* that the 
outcome of this process of random sampling will be a set of sam- 
ples in which the relative frequency of samples in which the 
division of opinion is 0 per cent Radical, 10 per cent Radical, 20 
per cent Radical, 30 per cent Radical, . . . , 100 per cent Radical 
will be approximately the same as the probabilities of a binomial 
distribution in which Pi = 0.50 and p 2 = 0.50 and N = 100. f In 
other words, relative frequencies of the sample percentages may 
be estimated a priori, by means of the probability calculus. 
Furthermore, since the size of the sample is large {N = 100), the 
calculation of the probabilities can be simplified by using the 
normal curve as an approximation to the binomial distribution. J 
*In this problem, the curve will have a mean of 50 per cent, 
because the population is equally divided between Radicals and 
Conservatives by hypothesis, and a standard deviation equal to 
5 per cent.§ The normal curve, with a mean of 60 per cent 
and a standard deviation of 5 per. cent, thus gives approximately 
the ‘‘sampling distribution^' for sample percentages taken from 
a population in which the division of opinion is exactly 50 per 
cent; and this is the sampling distribution of sample percentages 
conceived in the preceding paragraph. 

The foregoing result is not limited, however, to cases in which 
the actual division of opinion in the entire population is exactly 

* See also ibid, 

t When the symbol for a sample statistic is in boldface type, it refers to 
the corresponding population parameter; thus here P\ and Pz refer to the 
population values for which pi and p 2 are corresponding sample statistics. 

t See pp. 283-290. 

i When the variable is expressed as a percentage instead of as an absolute 
deviation from an integral mean value, the formula for the standard devia- 
tion is oTperioit “ \/(0.5)(0.6)/iV’. Cf. p. 283. 
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fifty-fifty but may be shown to be valid for any division of opinion 
in the^ population.^ Thus if the percentage of Radicals in the 
population is Pi and the percentage of Conservatives is p 2 (where 
pi +,p 2 = 1) and samples of size N are drawn at random from this 
population, with replacements as above, then the relative fre- 
quencies of various sample percentages of Radical opinion will be 
given approximately by the probabilities of a normal frequency 
curve whose mean is Npi and whose standard deviation is 

This conclusion is of capital importance in making inferences 
about a population from which a single random sample has been 
drawn, as will now be demonstrated. 

Statistical Inferences from Samples. Types of Inference. In a 
real instance, no omniscient being is available to -record every- 
one’s opinion. Prior to the actual election, the only practical 
way of determining the division of opinion is to take a random 
sample from the population. This may be done by stopping peo- 
ple on the street, ringing doorbells, sending out letters, or the 
like.^ When the results of the sample poll are counted, they may 
be used to draw inferences about the true division of opinion in the 
population in three ways — that is to say, three types of inference 
iimay be drawn. (1) A certain hypothesis regarding the true 
I division of opinion may be tested as to its reasonableness in the 
flight of the sample results and either rejected or accepted. 
1/2) So-called ‘^confidence limits” may be set up for which it may 
^e said that there is a given probability that these limits include 
:the true value. (3) A best single estimate may be made of the 
^population percentage; this is called an “optimum estimate.” 
jEach of these three types of inference will now be studied. 

Testing a Hypothesis as to the Population Percentage. Let the 
hypothesis be set up that the population is evenly divided 
between Radical and Conservatives. Suppose the sample poll of 
100 voters shows 57 Radicals and 43 Conservatives. Although 
the sample shows a percentage in favor of the Radicals, it is 
possible, of course, that it may be misleading. Almost any result 
might be yielded by a single sample, whatever the population. 
If the population consisted even of 999,900 Conservatives and 100 
Radicals, it would still be possible for a random sample of 100 to 

^ For proof of this, see Smith and Duncan, Sampling Statistics. 

* For further discussion see ibid. 


^pipi/N 
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consist of all Radicals. Such a result would not be very proba- 
ble, however, and the reasonableness of any hypothesis must be 
judged by the probability of the sample result on the assumption 
that the hypothesis is valid. 

The general procedure for testing the hypothesis is as follows: 
First, the risk that is to be allowed in rejecting a given hypothesis 
when it is in fact true must be decided upon. The coefficient of 
risk,” as it is called, is commonly, but not necessarily, set at 0.05. 
In other words, it is the common practice to run the risk of reject- 
ing a hypothesis 6 times out of 100 when it is in fact true. When a 
sampling distribution is normal, this is often done by saying that a 
given hypothesis will be rejected if the sample result falls beyond 
±2d from the mean value given by the hypothesis.^ In the 
present instance, the hypothesis that the true division of opinion 
is fifty-fifty suggests that random samples of 100 taken from such 
a population will have a mean percentage of Radical votes equal to 

= 60 per cent and a standard deviation of sample percentages 
equal to \^p\Pi/N == \/(0-5)(0.5)/r60 = 5 per cent. 

Accordingly, 95 per cent of the sample percentages would fall 
between 50 per cent ±2X6 per cent, or between 40 and 60 per 
cent; 5 per cent of sample percentages would fall below 40 and 
above 60 per cent. Hence, if this hypothesis is rejected when a 
single sample return yields a percentage of Radical vote below 
40 or above 50 per cent, then the hypothesis would in many 
sample polls be rejected only 6 per cent of the time when it was 
actually true. In other words, the rejector would be wrong only 
1 out of 20 times in a large number of tries. 

For the given problem, suppose the coefficient of risk is put at 
0.05. Since the sample return is 67 Radical votes out of a total 
of 100, the hypothesis of an equal division of opinion is not to be 
rejected, for the sample result does not fall in the region of rejec- 
tion beloiy 40 or above 60 per cent. In this instance, the sample 
result does not deviate sufficiently from the hypothetical per- 
centage to cause its rejection. If the sample return had been 
62 Radicals and 38 Conservatives, however, the hypothesis of 
an equal division of opinion would have been rejected and it would 
have been concluded that the Radicals were in the majority. 

^ The desirability in some cases of using regions of rejection that fall all 
above or all below the mean are discussed in ibid. 
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This argument and these conclusions are illustrated graphically 
in Fig. 102. 

From the figure it is seen that with a sample result of 57 per 
cent the hypothesis that pi == 0.5 is accepted while with a sample 
result of 62 per cent the hypothesis that = 0.5 is rejected. 

Determining Confidence Limits for Population Percentage, 
^Before confidence limits can be established for a population 
percentage it is first necessary to decide upon the degree of con- 
_fidence that is to be placed in the computed limits. This 
usually determined bv so choosing the limits that the Drobal;>ilitY 
^_of their including the true percentage equals an agreed-upon 



Fig. 102. — Sampling distribution of sample percentages of 100 votes. 

^gure. called the confidence coefficient. For example^ if the 
^ confidence coefficient is^et at 0.95, as is the common pra ctice, 
then the limits will be so chosen that the probability of their 
embracing the true value is just 0.95. 

In the case of a normal sampling distribution, confidence 
limits with a confidence coefficient of 0.95 may be set up as fol- 
lows: Choose as the upper confidence limit a value for the popula- 
tion percentage that, if it were the true value, would make the 
probability of getting the given sample value or a lower sample 
value just equal to 0.025. Since the sampling distribution is 
normal, this upper limit may be obtained by choosing pi so that 
the sample value of 57 per cent falls at --2d from the mean value 
of the sample percentage, i.e., at — 2d from Pu The mathematical 
equation becomes 
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0.57 -i>x= -2^^ 
or, since ^2=1— Pu 

0.57 - = -2 


When solved for pi^ this becomes 
0 . 67 +^ + 2 \/® 


^1 = 


57)(0.43) 1 

N JV* 




When N is large, as it must be if the normal distribution is to be 
used as an approximation to the binomial distribution, the terms 
2/N, 4/i\r, and \/N^ can be dropped from the above equation 
without materially affecting the result. In this approximate 
form it becomes 


pi = 0.57 + 2 


4 


(0.67) (0.43) 
100 


0.67 


In effect, this indicates that the upper confidence limit can be 
found approximately by adding to the sample percentage twice 
the standard deviation of the sampling distribution, computed 
with the sample percentage in place of the hypothetical popula- 
tion percentage. In general, if pj is taken as the sample per- 
centage (note that sample statistics are printed in text type and 
the corresponding population parameters in boldface type), the 
upper confidence limit of the population percentage is given by 

= Pi + 2 (1) 

This is shown graphically in Fig. 103a. 

In a similar manner, the lower confidence limit is given approxi- 
mately by the formula 


( 2 ) 
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For the given instance, in which pj = 0.67, this lower limit is 

0.47 

This is shown graphically in Fig. 1036. 

How the upper limit is determined, how the lower limit is 
determined, and the resulting range or total interval between the 
confidence limits are pictured graphically in Figs. 103a, b, and c. 



determined. 

The limits of the range are 0.47 to 0.67. This is known tech- 
nically as the “confidence interval” and is shown in Fig. 103c., 
Owing to the manner in which the confidence limits were derived, 
it may be said that .there is a probability, of 0.95 that t his con- 
Jhdence interval includes the true population percentage. By this 
is mean£ that^ ifjsonfidence intervals were set up like this fron^ 
many samples, 95 per cent of them would include the true 
population percentage. 

'". An Optimunf Estimate of the Population Percentage. Up to 
this point in the argument, a particular hypothesis regarding the, 
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population has been tested and a method of setting up confidence 
intervals has been devised. A final problem of statfistical 
^erence is to indicate a method of making a single best estimate 
pf the population percentage from the given sample. Various 



Values of/^ 

Flo. 104. — 'Diagram showing relationship between probability of sample and 
likelihood of i)opulation percentage. 

methods are employed, but the one that has received consider- 
able prominence in recent years and that will be employed here 
is the method of maximum likelihood. 

When a population percentage is given, the probabilities of 
« various sample results may be determined from,, the sampling 
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distribution of sample percentages, in this case, approximately 
from ^he normal frequency curve. The analysis here runs from 
fl. given population percentage to probabilities of various sample 
jfisulta.^When a particular s a.Tnp1p result is giv en, however, . 
is poss ible to determine the probabilities of obtaining this sample 
resulOrom v arious hypothetical values for the population per- 
centage. H ere the analysis runs from a given sample percentage 
to the probabilities of obtaining the particular sample from 
various hypothetical population percentages. In the latter 
analysis, the logarithm of the probability of the given sampl e 
rpsult for a particular value of the population percentage is 
called the likelihood of the populat ion pe rcentage. 

As shown in Figs. 104a to 104c, these likelihoods ^11 vary for 
different hypothetical values of the population percentage. The 
va^e of the populatio n per centage that has t he max imum li keli- 
1S )od is considered the best^ or optimum^ estima te o f the popula- 
t ion pe rcentage; this is shown in Fig. 104d. Figures 104a to 
foicTsTibw graphically how the likelihoods of various population 
percentages (or, more exactly, their antilogs) vary with changes 
in the hypothetical values for these percentages. These various 
results are summarized in Fig. ' 104d, which, if completed for a 
large number of hypothetical values of the population per- 
centage, would become a smooth curve showing the variation 
in the antilogs of the likelihoods of pi with changes in pi. It is 
to be noted that the maximum point of this curve is also the 
point of maximum likelihood, since a logarithm is a maximum 
when its antilog is a maximum. 

Without undertaking the mathematical analysis involved,^ 
it may be pointed out that the value of pi which has the maximum 
likelihood is the value for which pi = p^. In other words, 
the maximum likelihood estimate of a population percentage is the 
percentage yielded by a given sample. This then becomes the 
best estimate of the population figure; that is to say, the sample 
percentage is the optimum, or best, estimate of the population 
percentage. 


SAMPLING OF MEANS AND VARIANCES 

Sampling Distribution of Means and Variances. The Mean. 
Most of the preceding analysis applying to sample percentages 
^ For such analysis, see ibid. j 
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applies equally well to means of samples from a continuously 
distributed population. If the population is normal in form, it 
can readily be demonstrated that means of samples from such a 
population will form a frequency distribution which is also normal 
in form, the mean of which is the mean of the population and the 
variance of which is the variance of the population divided by the 
size of the sample. 

If the population is not normal,' the sampling distribution of 
sample means nevertheless tends to be normal, with a mean 
equal to the mean of the population and a variance equal to the 
variance of the population divided by the size of the sample.^ 

Accordingly, the equation for the standard deviation of the 
sampling distribution of sample means is as follows: 


dx = 


d 


( 3 ) 


This is conventionally called the “standard error” of the mean.® 

The Variance, If samples are taken from a normal population, 
the sampling distribution of sample variances is not normal for 
small samples but approaches the normal form as the samples 
become larger, say larger than 30 cases. The mean of this 
normal distribution is the variance of the population, and the 
variance of the sampling distribution is the variance of the 
population multiplied by 2/N. 

It is to be noted that, if the population is not normal, the 
sampling distribution of sample variances may not become 
normal, even for relatively large samples. Hence the use of 
the normal curve for making inferences about a population 
variance when the population is not normal may be an unwise 
procedure, even when the sample is large. 

But for variances of large samples taken from normal popula- 
tions, the standard error of the variance is given by 


djs dyi — W 

1 Ibid, 

• Standard errors are printed in boldface type because they represent the 
standard deviations of the populations of all possible sample statistics of 
the type in question. Thus dx ia the standard deviation of all possible 
j(amp)e ^*'8. 
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The Standard Deviation, For standard deviations of large 
samples taken from normal populations, the standard error of 
the standard deviation is given by 




d 


( 6 ) 


Inferences about Population Means and Variances. Since 
the sampling distribution of sample means tends to be normal 
in form and the same is true of the sampling distribution of 
variances and standard deviations, if the population is normal, 
it follows that the normal curve can be used to make inferences 
about the population values of these parameters from correspond- 
ing sample statistics. 

Testing a Hypothesis about the Population Mean, To illustrate 
how a hypothesis about a population mean may be tested, con- 
sider the following example. Suppose it is claimed that the 
mean length of life of a certain make of shoe (with constant wear) 
is 11.5 months. A random sample of 100 shoes is tested, and 
it is found that the average length of life of this sample is 10.8 
months. The standard deviation of the sample is 1.2 months. 
Do these sample results warrant the rejection of the claim of a 
true mean value of 11.5 months? 

To answer this question, proceed as follows: Let the risk of 
rejecting a hypothesis when it is true be set at 0.05. Then cal- 
culate the standard deviation of the sampling distribution of the 
mean (the ^‘standard error’' of the mean, as it is called) from 
Eq. (3). Since the standard deviation of the population is not 
known in this instance, the standard deviation of the sample 
must be used in its stead. ^ 

The value of 6$ for the given problem is accordingly 
dx = —~z = 0.12 month 

Vloo 

Next, calculate the diiference between the hypothetical value 
of the mean and the sample value of the mean. This is 


10.8 — 11.5 == 0.7 month 


Finally, compare this difference with the standard error of the 
mean. If the difference is more than twice the standard error, 

'This substitution does not materially affect the analysis when the 
sample is large. For further discussion, see ibid. 
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the hypothesis will not be accepted. In the present instance 0.7 
is over five times greater than 0.12; so the claim that th^ true 
mean is 11.5 is rejected. The sample mean deviates too greatly 
from the hypothetical mean for the latter to be accepted as 
reasonable. 

Confidence Limits for the Population Mean, Confidence limits 
for the true mean with a confidence coefficient of 0.95 will be 
obtained by laying off 2dx plus and minus from the sample 
value. Thus, in the present problem these limits will be 
10.8 ± 2(0.12) = 11.04 and 10.56. Accordingly it can be said 
that there is a probability of 0.95 that the interval from 10.56 to 
11.04 includes the true population mean within its range. 

Optimum Estimate of the Population Mean, If the method of 
maximum likelihood is used to give the best estimate of the 
population mean, it is found that the sample mean is the maxi- 
mum likelihood estimate of the population mean. Hence, in 
the present instance the best estimate of the population mean 
is 10.8 months. 

Testing a Hypothesis about the Population Variance, The 
same analysis can be applied to inferences regarding population 
variances from sample variances. Suppose it is claimed that 
the true variability in the life of the given make of shoes is 
1.0 month. As in the case of the mean, this hypothesis may be 
tested by comparing the hypothetical value with the standard 
deviation of the sample of 100 shoes, which it will be assumed is 
1.2 months. 

The variances, or squares of the standard deviations, are 1.0 
and 1.44 square months, respectively. Their difference is 
1.44 — 1.00 = 0.44 square month. The standard deviation of 
the sampling distribution of sample variances, i.e., the standard 
error of the sample variance, is 

.■^1 = 1,00 = 0.141 

Since the difference between the hypothetical value and the 
sample value is more than three times (0.44/0.14 = 3-|-) the 
standard error of the sample variance, the hypothesis must again 
be rejected.^ 

* For more exact methods especially applicable to small samples, see ibid. 
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If it were desired to test a hypothesis about the standard devia- 
tion^rather than about the variance, Eq. (5) would be used. In 
the present instance, the population standard deviation is hypo- 
thetically set at 1.0 month and the standard deviation in the 
sample is 1.2 months; the difference is 0.2 month. Using Eq. (5), 
the standard error of the standard deviation in this problem is 
found to be 


. _ 1.00 


0.07 


Since the difference is almost three times the standard error, the 
hypothesis is rejected as unreasonable. 

Confidence Limits for the Population Variance, Confidence 
limits for the population variance with a confidence coefficient of 
0.95 are given by 

<*L.pi. = ± 2d., 

, = 1.44 + 2(0.14) = 1.72 and 1.16 

It can thus be said that there is a probability of 0.95 that the 
interval from I.IG to 1.72 includes the true variance. The cor- 
responding interval for the population standard deviation is from 
1.06 to 1.34, obtained by making use of Eq. (5). 

Optimum Estimate of the Population Variance, Finally, the 
maximum likelihood estimate of the population variance is (for 
large samples) approximately the variance of the sample.^ 
Hence the best estimate of the population variance in this instance 
is 1.44, which gives a population standard deviation of 1.2. 

CONCLUSION 

From the few illustrations in this chapter, it should be clear 
that the normal curve is very useful in making inferences about 
populations from random samples. It can be used to measure 
sampling fluctuations in sample percentages, sample means, and, 
in certain instances, sample variances, as well as in a number of 

N 

^ For small samples the multiplier should be applied to the sample 

variance to give a better estimate of the population variance. Thus the 
optimum estimate, if iV” is small, say less than 30, is as follows; 



Cf, ibid,. 
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other statistics. It also has many uses in more advanced sam- 
pling analyses and is probably the most important sampling , 
distribution that occurs in statistical theory. / 

Table 23 contains not only the standard errors discussed in 
this chapter but also the standard errors for a number of other 
statistics. The method of appl 3 ring these formulas to test 
hypotheses, to set up confidence intervals, or to obtain optimum 
estimates is similar for all statistics obtained from large samples. 

Table 23. — Sampling Errobs in Elementary Statistics for Which the 
Sampling Distribution Approximates the Normal Curve 
(Ordinarily these formulas for standard error cannot be used for N < 30) 


Statistics 

Standard errors 

jp 

il 



<r 

d, 

y/2N 

q. VFl(fft+3) 

2(5/32 - 6/3i - 9) 

^ 1.225 * 
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Mi 
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A.D.Mi 
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CO 

1 

" 

■o' 

Qi 

<2. 
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dQi ™ dQi 1.36263 ■ j 

VN 
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Vn 
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^ * C/. Wavgh, AtASBT E., BUmenU of SUUi^ieal Method (IddS)^ pp. 142^147. 



PART IV. 


Study of Bivariates and Multivariates 

CHAPTER XIII 
SIMPLE CORRELATION 

CORRELATION FUNDAMENTAL TO KNOWLEDGE 

Progressiv^evel gpment in the methpds of science and philos- 
op hy has been cham cterized by increase in the knowleage of 
relationships, or corr^tions. * Nature ha s been found to be a 
multiplicity of interrelated forces. The phenomena of the 
physical world outside man seem to be well adapted to this 
concept of interrelationship. The same is true with respect 
to phenomena having to do with human beings and their 
environment. 

Progress in the Discovery of Correlation. In the physical 
sciences, ^here the laws" of nature are, within certain limit^, 
determinate, experimental method has sufficed to disclose innu- 
merable relationships. Many of these physical correlations have 
become definitely knoAvn as cause and effect relationships.'^ To 
some degree, too, this is true of biology, anthropology, geology, 
and the like. In these fields of study, great progress was made 
possible by the use of observation of ‘‘cases," by tracing cor- 
relations previously known or suspected, and by laboratory 
experiments. In the social sciences, however, the establishment 
of certain knowledge, or knowledg^oTa high degree of probability 
regarding relationships, is a more difficult problem; and little 
scientific progress, comparatively speaking, has been made 
through the speculative method. This is particularly true so 
as cause and effect relationships are concerned. f irT* •' 

For example, philosophical speculation, based upon qualitative 
or semiquantitative obsVrvaBon of experience seemed to many 
economists hf the eighteenth, nineteenth, and twentieth cejatwcim ^ 



322 STUDY OF BIVARIATES AND MULTIVARIATES 


to have codii^ed the relationship between money and credit om the 
one hand and prices and many social problems on the other hand. 
But no such certainty among these social scientists now exists as 
to the nature of the cause and effect order of events. In its 
earlier conception, the principle of the quantity theory of money 
seemed to be one of extraordinary simplicity and determinate- 
ness; but the more it is studied in its quantitative aspects the^ 
more complicated it is found to be in reality. By the 1930'fe 
and 1940's, the world of scientific monetary theorists came to be 
characterized by confused controversy. The practical world still 
awaits their solution of the theoretical problem in order to make 
possible a world-wide solution of the problem of monetary reform. 
Some say that increases or decreases in the quantity of money 
cause rising and falling prices, respectively; but others, with con- 
vincing argument, maintain that rising prices cause an increase in 
the quantity of money, and vice versa. It is a moot question as 
to whether or not statistics can come to the rescue in the matter 
of deciding the direction of the cause and effect relationship; 
but at least the technique has been develpped to disclose the 
facts of relationship more precisely than was ever before 
possible. 

By the latter half of the nin eteen th century, in many fields of 
study, a point had been reached where speculation concerning 
relationships could advance no farther with the existing tech- 
niques. More exact measurement of relationship was needed. 
Many questions in biology, anthropology, and the social sciences 
generally awaited a scientific answer to the question: How can 
relationship be measured? Two interesting attempts were made 
by American scholars to devise a method of measuring relation- 
ship, one in 1877 and the other in 1892.^ Credit for the discovery 
of a method, and for its subsequent mathematical development, 
however, belongs largely to the scholar^ of England. 

Origin and Development of the Measurement of Correlation. In 
the nineteenth century pre-Darwini an and Darwini an do^r ines of 

^BowpiTCH, H. P., Growth of Children,” Eighth Annual Report 
of the State Board of Health of Massachusetts (1877), pp. 276-824; Bryan, 
,W. L., ^On the Development of Volimtary Motor Ability,” American 
of Psychology f Vol. 5 (1892), pp. 128-204. These are both described 
M. Walker, Sivdiee in the Hietory of Statktical M^hod (1929), pp. 
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evolution were takin g root , and the question of the influence of 
heredity vs. enviroAment upon human characteristics was in a 
state of rarefied speculation and controversy. The experimental 
data appeared chaotic and amenable to as many interpretations 
as there were interpreters. 

One of the great nineteenth-century students of the problem of 
heredity was Sir Francis Galton. He had been profoundly 
impressed by Darwin’s Origin of Species (1859), concerning which 
he said,^ ‘‘Its effect was to demolish a multitude of dogmatic 
barriers by a single stroke, and to arouse a spirit of rebellion 
against all ancient authorities whose positive and unauthenticated 
statements were contradicted by modern science.” Galton made 
numerous studies on the subject of heredity. The question that 
was motivating his studies was: How is it possible for a whole 
population to remain alike in its features, as a whole, during many 
successive generations, if the average produce of each couple 
resemble their parents? He attacked the question by studying 
sweet peas, moths, hounds, and finally the records of human 
families, which he obtained by offering prizes. 

Between the years 1877 and 1889, Galton worked out a mathe- 
matical method by which he could give an exact measure of the 
relationship between, for example, height of childr en and the 
average heights of their parents. By statistical me^urement he 
found that, if the ^tafuire of a group of parents is found to be, say 
y inches above or below the general average of the race, the aver- 
age stature of their children will be only inches above or below 
the average of the race; and he induced the law that the mean 
heights of offspring tend to “regress back toward the mean of the 
race” in spite of the strong hereditary influence of the parents. 
This is the famous law of regression to type, although the exact 
figure I is not to be taken as final. 

The method Galton used was based upon the median and 
quartiles and has not been generally followed in subsequent work.- 
In the 1890*s another method, based on the arithmetic mean 
and the standard deviation, was devised by Karl Pearson. His 

1 “Hereditary Talent and Character,” Macmillan^ s Magazine^ Vol. 12 
(May, 1865-October, 1865), pp. 157-166, 318-327; Hereditary Genius 
(1869, 2d ed, 1892); English Men of Science (1874); Human FacuUy (1883); 
Record qf Family Faculties (1884); Life History Album (1884); NcUural, 
InherHanee a889). Cf Walker, op. cU., pp. 102-103. - 
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method has been widely adopted and is known as the Pearsonian 
coejBBicient of correlation.”^ i 

It should be pointed out that in the fields of meteorology and 
astronomy mathematicians had previously worked out a formula 
for a joint or bivariate normal frequency distribution. This 
gave the probability of the simultaneous occurrence of two errors 
of observation but did not directly indicate a measure of correla- 
tion between them. Work in this field was more concerned with 
the simultaneous occurrence of independent errors than of 
correlated errors.^ Galton, as already indicated, was primarily 
concerned with the problem of correlation, and it remained for 
Karl Pearson and others to combine the work of Galton and the 
work of the mathematicians into a unified theory of correlation. 
Pearson’s development of the theory of correlation will be 
explained on page 338 to 349. 

Applications of the Method by Social Scientists, As early as 
1901, R. H. Hooker, using the Pearsonian coefficient, stn.died 
correlation between marriage rates and trade. He correlated 
marriage rates with per capita exports of England, with per 
capita imports, and with other trade events.® In 1906, G. 
Udny Yule likewise made a study of * correlation between mar- 
riage rates and trade. He also correlated trade activity with 
birth rates and death rates but found little correlation between 
them.^ 

1 Cf, Walker, op, cit.j pp. 110-115; Pearson, Karl, “Notes on the 
History of Correlation,” Biomeirika^ Vol. 13 (1920-1921), pp. 25-45, where 
he cites W. F. R. Weldon, “ Variations Occurring in certain Decapod Crus- 
tacea — J. Crangon vulgaris,^' Proceedings of the Royal Society of London^ Vol. 
47 (1890), pp. 445-453; Weldon, W. F. R., “Certain Correlated Variations 
in Crangon vulgaris** Proceedings of the Royal Society of London, Vol. 51 
(1892), pp. 2-21; Yule, G. U., “On the Theory of Correlation,” Journal of 
the Royal Statistical Society, Vol. 60 (1897), pp. 812-850. 

s Prbtorius, S. J., “Skew Bivariate Frequency Surface, Examined in the 
Light of Numerical Illustrations,” Biometrika, Vol. 22 (1930-1931), pp. 
109-223; Pearson, Karl, “The Contribution of Giovanni Plana to the 
Normal Bivariate Frequency Surface,” Biometrika, Vol. 20A (1928), pp. 
295-298; Walker, Helen M., “The Relation of Plana and Bravais to the 
Theory of Correlation,” Isis, Vol. 10, No. 34 (1938), pp. 466-484. 

•“Correlation of the Marriage-rate with Trade,” Joumcd of the Royal 
Statistical Society, Vol. 64 (September, 1901), pp. 485-492. 

•Yule, G. Udnt, “On Changes in the ^Marriage- and Birth-rates in 
^nglahd and Wales, Etc.,” Journal of the Royal Statistical Society, Vol. 69 
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The entire science of biometrics has been built up by the 
devefopment of correlation methods; Karl Pearson is one of the 
founders of Biometrika, the scientific organ in that field of study^ 
Correlation measurement has been intensively applied in psy-/ 
cholog^l and educational research.^ In recent years, the^ 
j correlation method has played an important role in the analysis 
: of economic problems and in economic theory, a trend particularly 
evident in the field of agricultural economics. 

THE BIVARIATE FREQUENCY DISTRIBUTION 

The statistical basis for the study of correlation is the bivariate 
or multivariate frequency distribution. In the univariate 
frequency distributions studied in the previous chapters, the 
data were classified according to a single characteristic. In 
bivariate or multivariate distributions, data are classified accord- 
ing to two or more characteristics. This chapter will be con- 
cerned with the analysis of bivariate distributions. Chapter 
XVI will deal with multivariate distributions. 

An Illustration of a Bivariate Distribution, Table 24 shows 
the distribution of grades of 81 freshmen in a second-semester 
English course at Mount Holyoke. For each of these 81 students 

Table 24. — Grades of 81 Mount Holyoke Freshmen in a Second- 
semester English Course* 


Grades 

Xi 

60 - 

80- 

100 - 

120 - 

140- 

160- 

180- 

200 - 

220 - 

240- 

260- 

280- 

300- 


Frequencies 

F 

1 

0 

3 

0 

2 

9 

8 

16 

17 

13 

9 

1 

J 

81 


(1906), pp. 88-132; ^*The Applications of the Method of Correlation to Social 
and Economic Statistics,^’ Journal of the Royal Statistical Society ^ Vol. 72 
(1909), pp. 721-730. 

^ Rugg/ Harold O., Statistical Methods Applied to Education, ^ 
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there is also available the grade in first-semester English. Hence 
they may be cross-classified according to both their first-c and 
second-semester grades. This has been done in Table 25. 


Table 25 . — Bivakiatb Frequency Distribution op 81 Mount Holyoke 
Freshmen According to Their Grades in First- (Xa) and Second- 
(Xi) Semester English 



60 - 

80 - 

100 - 

120 - 

140 - 

160 - 

180 - 

200 - 

220 - 

240 - 

260 - 

280 - 
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60 - 
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80 - 
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120 - 
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140 - 




1 



1 






2 

160 - 




5 

3 

1 






1 

9 

180 - 





2 

4 

2 






8 

200 - 






3 

4 

7 

2 




16 

220 - 
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4 

7 

4 



17 

240 - 
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7 

3 

1 

i 

13 

260 - 
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4 
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9 

280 - 
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1 

300 - 
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2 

2 

F 

2 

nr 

1 

7 

5 

8 

9 

13 



B 

2 

81 


The bivariate frequency distribution represented by Table 25 
gives more complete information than is contained in the uni- 
variate frequency distribution of Table 24. Of the 8 students 
having second-semester grades between 180 and 200, the seventh 
row of Table 25 shows that 2 had first-semester grades between 
140 and 160, 4 had first-semester grades between 160 and 180, 
and 2 had first-sem^ter grades between 180 and 200. This is a 
jsmhll univariate frequency distribution of the group of students 
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who ihad grades between 180 and 200 in their second-semester 
course. In Table 25 there are 11 rows and 11 columns each of 
which contains a univariate frequency distribution. Since there 
are 11 subgroups of 11 groups, there are altogether 121 classes, 
represented in the table by 121 squares, or cells, of which 28 cells 
contain frequencies. 

The totals of the columns of Table 25 gives the univariate 
frequency distribution of all the students classified according to 
their first-semester English grade. The totals of the rows gives 
the univariate frequency distribution of all the students classified 
according to their second-semester English grades. 

For each of the columns an arithmetic mean could be calcu- 
lated and the question could be answered: Did girls who earned 
high grades in their first-semester English average higher grades 
in second-semester English than did the girls who attained only 
low grades in their first-semester English? An arithmetic mean 
could similarly be calculated for each of the row frequency 
distributions. For all the 11 column frequency distributions 
and all the 11 row frequency distributions the standard deviations 
also could be calculated. In other words, in this bivariate 
frequency distribution there are 22 univariate frequency dis- 
tributions in addition to the 2 univariate frequency distributions 
represented by the totals for the respective variables. Each of 
these 22 frequency distributions might be analyzed in the same 
way as any frequency distribution. 

METHODS OF SUMMARIZATION AND COMPARISON IN 
BIVARIATE DISTRIBUTIONS 

The characteristics of a bivariate frequency distribution can 
be described by various statistics. Many of these are the same 
as the statistics employed in the description of a univariate 
frequency distribution, but some are new. Thus, the central 
tendency of one of the two* variables may be measured by its 
mean, its mode, or its median. The dispersion of this variable 
may be measured by its range, standard deviation, average 
deviation, or quartile deviation; and its skewness and kurtosis 
may be measured by jSi and 02 , respectively. The same is true 
for the other variable and for the numerous univariate fr^^quency 
distributions that make up the details of a single bivariate 
distribution, as explained in the preceding paragraph. New 
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statistics are required, however, to describe the tendency the 
variables to vary in unison. A bivariate frequency distribution 
thus presents the new problem of measuring correlation and the 
discovery of statistics for measuring it. 

Progressions of Means. If the data are grouped in the form 
of a bivariate scatter diagram such as Table 25, one way to 
measure the association between the two variables is to compute 
the mean values of one variable for various values of the other 

Xi X2\204,t 



so 100 120 140 160 ISO 200 220 240 260 260 300 " 

Fio. 106. — Progression of the means of JTi with changes in Xi. 

variable. In Table 26, for example, the means of the columns 
would show how the Xi variable tends to change on the average 
with changes in Xt, and the means of the rows show how the Xt 
variable tends to change on the average with changes in Xi. 
The values of these column and row means are given in Table 26 
and graphed in Figs. 105 and 106. 

The nature of the association between the variables is evident 
' from these graphs. Consider, for example, the progression of 
the means of Xi shown in Table 26 and Mg. 105. These show 
'"lhat the mean value of Xi t^ds to increase with increases in X*. 
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Thus, when Xi is between 100 and 120, the mean value of Xi is 
110; when Xi is between 200 and 220, the mean value of Xt is 
222.31 ; and when Xt is between 260 and 280, the mean value of Xi 
is 266.0. Although the increase in the average value of Xi with a 
given increase in Xt does not appear to be uniform, the progres- 
sion of the means of Xi with a change in Xt does appear to follow 



Fig. 106. — Progression of the means of -X ’2 with changes in X\, 


a straight line. The same can be said of the progression of the 
means of X 2 with changes in Xi. 

Lines of Regression. The tendency of the progressions of 
means to follow straight lines suggests the following hypothesis: 
Consider first the progression of the means of Xi with changes in 
X 2 . Suppose that Xi is related to X 2 in such a way that an 
increase in X 2 of ^one unit always produces an increase in Xi of, 
say h units, b being a constant. If X 2 were the only factor affect- 
ing Xi, all the values of Xi, when plotted, would fall exactly on a 
straight line and the progression of all means would be perfectly 
linear. If th§re were other forces affecting Xi, however, causing , 
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it to be higher or lower than the value expected from its associa- 
tion with Jfa, then the actual values would not fall on a straight 
line but would be scattered about that line. If this view of the 
variation between Xi and is adopted, a straight line fitted to 
the data should give the law of relationship between Xi and X^ 
and the scatter about it should give the deviation from this line 
caused by the other factors affecting Xi, 


Table 26. — Means op Rows and Means op Columns 
From correlation table showing the relationship between second- and first- 
semester grades c/ 81 Mount Holyoke freshmen 


Progression of means of second-semester 
English grade {X 1 ) with successive values 
of first-semester English grade (A’ 2 ) 
Regression of A 1 on At 
(Vertical frequency distributions in Table 26) 

Progression of moans of first-semester 
English grade (At) with successive values 
of second-semester English grade (A^i) 
Regression of At on A'l 
(Horizontal frequency distributions in 
Table 25) • 

Values of At 

Means of Ai 

Values of Xt 

Means of At 

Jtr 

6(K 

110.00 

60- 

130.00 

80- 


80- 


100- 

110.00 

100- 

83.33 

120- 

152.86 

120- 


140- 

178.00 

140- 

160.00 

160- 

195.00 

160- 

141.11 

180- 

203.33 

180- 

170.00 

200- 

222.31 

200-’ 

200.00 

220- 

241.11 

220- 

225.29 

240-,. 

250.00 

249- 

234.62 

260- 

266.00 

260- 

256.67 

28*0- 

310.00 

280- 

230.00 



300- 

290.00 


A similar view could be taken of the variation in the mean 
value of X2 with changes in Ai and would justify drawing a 
straight line to show the law of relationship between X2 and Xi. 
The lines that are derived to show the relationship between the 
mean value of one variable and changes in value of another are 
called ‘‘lines of regression,'' following Galton, who used this term 
in his original study of the relationship between the heights of 
children and the heights of their parents. 

The Line of Regression of XionX^. Suppose the above hypoth- 
esis is adopted, namely, that Xi is linearly related to X^ and 
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that deviations from this relationship are the result of forces 
independent of X 2 . The statistical problem then becomes how to 
draw the line that is supposed to show this linear relationship. 

One of the simplest ways of finding the line of regression of Xi 
on is to plot the progression of the means of Xi for various 
values of X 2 and to draw a line freehand through the means so that 
it seems to fit the progression of means. The great difficulty with 
this method is that it involves considerable personal discretion 
and that no two persons will necessarily draw the same line. 

An impersonal method of fitting a line to a given set of data is 
the so-called “method of least squares.'' This fits a line to the 
data so that the sum of the deviations of the dependent variable 
from the line is zero and the sum of the squares of the deviations is 
a minimum (hence the name “method of least squares"). 
Mathematically, the first of these two conditions follows from 
the second, so that there is really only one condition, viz.y that of 
least squares. 

The use of the method of least squares to fit lines to a set of 
data goes back to the beginning of the nineteenth century. It 
first came into prominence in 180G, when Adrien Marie Legendre 
(1752-1833) published a book on new methods of determining 
orbits of comets. After the publication of this book, Karl 
Friedrich Gauss (1777-1855), a German mathematician, claimed 
that he had been applying this principle since 1795. 

Later it was shown that, if the method is used to fit a line to a 
sample set of data, then, under particular circumstances, the line 
so determined is the best, or optimum, estimate of the population 
line. For example, if data are available as to the orbit of a comet 
or planet and if a line or curve is fitted to these data by the method 
of least squares, then the line or curve so obtained would be the 
most probable estimate of the true orbit. ^ 

The line of regression of Xi on X 2 may be derived by the method 
of least squares as follows: Consider the point P, Fig. 107. This, 
according to hypothesis, would fall at P' if there were no forces 
associated with Xi other than X 2 , Supposedly, however, there 
are other forces that are independent of X 2 and make Xi smaller 
than this average value so as to cause the point to be located 
actually at P. Sined these other forces affect only Xi, the point 
is deflected in a vertical direction. The line of regression of Xi on 
^ See Smith and Duncan, Sampling Statistics, 
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X% is therefore to be obtained by minimizing the vertical devia- 
tions from the line. 

Let the equation of this line of regression of Zi on X2 be 
X[ = ai.2 + biiX^ 

The deviations would then be 

di,2 ~ Xi — ai.2 — ^12X2 

and the problem is to determine ai.2 and 612 so as to minimize the 



Fio. 107. — Diagram illustrating the fitting of the line of regression of Xi on Xs 
by the method of least squares (vertical deviations minimized) . 

sum of the squares of deviations like Xi — x\ shown in Fig. 107 , i.e. 
(Zi — X[ = xi —x'l because xi = Xi — x[ = X[ — i\), 

2!(Xi — Xi)^ = 2(Xi — Cut — bitXt)^ == minimum 

Accordii^ to the differential calculus, the conditions for min- 
imizing 2(X'i — ai.j — bxsXi)* are that the partial derivative 
with respect to oi.j and the partial derivative with respect to bu 
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should both be zero. These conditions are 
dX{Xi - ai.2 - bMl ^ _22(Xi - ai.* - 

0(li.2 

= 0 = Sdi.2 (1) 

dS(Xl — ai.2 — bl2X2y _ ^ t V NXr 

■ TT — 22:/(Ai — ai,2 ■” Oi2A2;A2 

0012 

= 0 - ^1,2X2 (2) 

If the parentheses are removed, these equations may be written 
X(l\.2 b\2'^X2 — SJTi 


(2ai.2 = iV’ai.2 because ai.2 is a constant.) 

The first gives ai.2 in terms of 612 as follows: 


Ctl.2 — Xi — bi2X2 


(3) 


(XXi/N = Zi, and SA^/iV = A2.) 

If this is substituted in the second, the value of 612 is found to be 



SA1X2 - NX1X2 
SAl - NXl 


(4) 


Equations (3) and (4) thus give the values of ai.2 and 612 in terms 
of the sample values of Xi and X2. If these values are grouped 
into class intervals and deviations are measured from an arbitrary 
origin, the last equation may be put in the form 


where 



(5) 


If deviations are measured from the means of Zi and Z2, then 


Ctl.2 = 0 


_ i:XiX2 

Xxl 


( 6 ) 

> 
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In the next chapter in which the work of measuring correlation 
is illustrated by numerical calculations, it is found that for the 



Fig. 108. — Diagram illustrating the fitting of the line of regression of on 
by the method of least squares (horizontal deviations minimized). 


bivariate frequency distribution of Table 25, 



455 


hvi = 


(81)(111)(57) 
(81) (81) 


493 - 81 


(57)» 

( 81 )* 


= 0.8322 


For these same data, — 217.4 and = 204.1 so that 
Oi.* = 217.4 — 0.832(204.1) = 47.58. The line of regression of 
Xi on Xi is thus X'l = 47.58 + 0.8322X,. 
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Line of Regression of X2 on Xi. The line of regression of X2 on 
Xi n&y also be obtained by either freehand or mathematical 
methods. A freehand line could be obtained by drawing a line 
through the progression of the means of X2 on Xi, A mathe- 
matical line could be obtained by the method of least squares. 

The preceding section determined a mathematical formula for 
the line of regression of Xi on X2 by minimizing the sum of the 
squares of the vertical deviations. Now X2 is assumed to be the 
dependent variable, and the line of regression of X2 on Xi is 
determined by minimizing the sum of the squares of the horizontal 
deviations (see Fig. 108 ). Except for this difference, the process 
is precisely the same as that described for fitting the other line 
and will not be repeated here. If the line of regression of X2 on 
Xi is represented by the equation 

X2 = 0,2.1 *4" ^21X2 


then minimizing S(X2 — Xj)^ = 2(X2 — a2.i — 621X1)2 gives the 
f jllowing values for a2.i and 621: 


or 



( 7 ) 

( 8 ) 

( 9 ) 


If deviations are measured from the means of Xi and X2, then 


^ 2.1 — 0 


_ 'LXiXi 


( 10 ) 


For the data of Table 25 the line of regression of Xi on Xi is thus 
found to be 


X'i = - 6.548 + 0 . 9 G 42 Xi 
This is showp,in Fig. 108 . • 
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Interpretation of a Line of Regression. A line of regression 
of one variable on another is to be interpreted as indicatihg the 
values of the first (the dependent variable) that would be 
obtained for various values of the second (the independent vari- 
able) if no other forces were affecting the dependent variable. 
If knowledge of the independent variable is all that is to be had, 
then the line of regression gives the best estimate that may be 
made of the dependent variable. 

The regression statistic a (that is, ai.2 or 02.1) gives the value 
of the dependent variable when the independent variable is zero. 
It is of only arbitrary significance, since its value is affected by 
the origin selected for measuring the independent variable as 
well as the units of measurement. The regression statistic b 
(that is, 612 or 621) is independent of the origin selected and indi- 
cates the change that would occur in the dependent variable per 
unit change in the independent variable. In the line of regression 
of on X2, for example, when X2 increases by one unit, Xi 
increases or decreases by 612 units depending on the sign of bn. 
The value of bn will not be affected by proportional changes in 
the units of Xi and X2. Similar statements hold for 621 in the 
case of the line of regression of X2 on Xi. 

Standard Deviation about Means^ or Line of Regression, If the 
progressions of the means or the lines of regression are used to 
measure the average relationship between two variables, some 
additional measure is desirable to determine the degree of repre- 
sentativeness of these measures. In the case of a monovariate 
distribution, it will be recalled, the representativeness of the 
mean depended upon how closely the cases were scattered around 
this mean value. This dispersion was measured by the standard 
deviation or some other measure. Similarly, in the present 
instance, the representativeness of the means of Xi, say, for 
various values of X2 will be shown by the dispersion of the cases 
around each mean. The standard deviation of the cases in each 
column around the mean of that column may thus be taken to 
show how well the mean represents the cases in the column. 
The same is true for any row. 

In Table 27 are given the standard deviations of the columns 
of Table 25 . The zero values refer to the columns in which 
I there is only one case. The other values center around 16 , their 
; average being 16 . 9 . 
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It is to be noted that the average standard deviation from the 
mean% of the columns, as well as the individual standard devia- 
tions from which this average is calculated, are considerably 
less than the total standard deviation of Xi, namely, <ri = 43.9. 
The column means are thus much more representative of the 
column values of Xi than the grand mean is of all the XiS. 


Table 27. — Standard Deviations for Columns of Table 25 


Column 

Nc 

XFxc * 

XFxc ^ 

Nc 

\ fXFxc * 

^ No 

(1) 

2 

0 

0 

0 

(2) 

0 




(3) 

1 


0 

0 

(4) 

7 

8,342.86 

1,101.8 

34.5 

(5) 

5 

480.00 

96.0 

9.8 

(6) 

8 

1,400.00 

175.0 

13.2 

(7) 


4,800.00 

533.3 

23.1 

(8) 


2,830.77 

217.8 

14.8 

(9) 


6,577.78 

365.4 

19.1 

(10) 


3,200.00 

290.9 

17.1 

(11) 

5 

320.00 

64.0 


(12) 

2 



KM 


81 





This may be explained by the fact that much of the total varia- 
tion in Xi is due to the variation from column to column, a 
variation that is presumably due to association with X 2 . When 
this variation is eliminated, the remaining variation is consider- 
ably reduced. A similar analysis would show the same results 
with respect to variation around the means of the rows. 

If the association between Xi and X 2 is measured by a straight 
line, the representativeness of this line may be measured by the 
dispersion of cases around it. Such a measure would be the 
standard deviation of the deviations from the line. The stand- 
ard deviation of the vertical deviations from the line of regression 
of -3 li on X 2 will measure the representativeness of that line, and 
the standard deviation of the horizontal deviations from the line 
of regression of A '2 on Xi will measure its representativeness. 
In either case, equals the sum of the squared deviations from 
the line divided by N. If the line is fitted by the method of 
least squares, the sum of the squared deviations from the line 
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may be. computed from the equations ‘ 

N<rii = = SXf - O1.2SX1 - buEXiXt 

and 


N<T2A ^ ^^2.1 ^ 2-^2 ®2.iSX2 — b2l^X2Xi 


or 


_2 — ^^.2 
<^1.2 - 


and 


^2.1 


AT 


c (11) 
( 12 ) 


These standard deviations from the lines of regression will always 
be less than the total standard deviation, because the variation 
represented by the line of regression has been eliminated by 
taking deviations from the line. 

The average standard deviations around the means of columns 
or rows and the standard deviations around the lines of regression 
may be called “first-order standard deviations,’^ in contrast to 
the total standard deviations, which may be called “zero-order 
standard deviations.” Sometimes the first-order standard 
deviations are called “standard errors of estimate” since they 
indicate the error involved in using a column or row mean or a 
line of regression as an estimate of the dependent variable. 

If the association between Xi and X2, say, is assumed to be 
measured by the means oi Xi for given values of Z2 or by the 
line of regression of Xi on X2, then the smallness of the first- 
order standard deviations relative to the zero-order standard 
deviations will give some measure of the degree of representative- 
ness of these measures of association. As will be seen in the next 
section, this measure of the degree of representativeness of a line 
of regression is closely related to the so-called “Pearsonian 
coefficient of correlation.” As a measure of the degree of repre- 
sentativeness of a progression of means, it is closely related to the 
“correlation ratio,” which is discussed in Chap. XV, Nonlinear 
Correlation. 

The Pearsonian Coefficient of Correlation. The progressions 
of means and lines of regression described above were concerned 
with describing the “law of relationship” between the two 
variables. They gave the average value of one variable associ- 

^ The proof of thi^ is as follows; 

Zdli.2 « Zdi.j(Xi — ai.2 hi2X2) 'Sdl.2Xl — (ll.2^di,2 bu2^U2X2 

But by the least-squares equations ( 1 ) and ( 2 ) Xdi.2 ^ 0 and Zdi.aXt 0 . 
Bence Zd|,2 ** '^di,2Xi *** — ui,2ZXi — &i2ZXiX2« 
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ated with given values of the other variable and showed how 
these average values tended to change in unison with the other 
variable. In this section, a measure of the degree of association 
between the two variables will be described. This measure is 
known as the Pearsonian coefficient of correlation after the man 
who devised it. 


Exports 
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Fig. 109. — A bivariate scatter diagram showing the joint variation in imports 
into and exports from the United States. [United States Department of Commerce, 
Monthly Summary of Foreign and Domestic Commerce of the United States, Vo/. 20 
{March, 1940), p. 37; Survey of Current Business, Vol. 21 {March, 1941), p. 37; 
Vol. 22 {March, 1942), pp, 5-19.] 


The coefficient of correlation suggested by Karl Pearson in 
1890 is 


SlilTs 


(13) 


In this equation, Xi and xt, refer to deviations from the mean 
and N to the number of pairs of cases. For the sake of simplicity, 
this coefficient will now be explained by reference to a bivariate 
distribution in which the cases are not grouped into class intervals. 

Arithmetic View of r. Table 28 and Fig. 109 show the joint 
variation ot two variables. They indicate that the large values 
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of X\ representing total exports from the United States, 1932- 
1941, are associated, for the most part, with the large vydues of 
X 2 , whicih represent total imports for consumption into the 
United States during the same period of years. 

The average of Xi, designated Xi, is found to be 2.89; and the 
average of Z 2 , designated X 2 , is 2.19. The deviations of each 


Table 28. — EIxports and Imports op Merchandise, United States, 

1932-1941 

(In billions of dollars) 



Total exports 

Total imports 
Xa 

Deviations from 
respective X 

Product deviations 
xixi 

Year 










x\ 

Xi 




(1) 

(2) 



(5) 




(3) 

(4) 








+ 



1932 

1.6 

1.3 

-1.29 

-0.89 

1.1481 


1933 

1.7 

1.5 

-1.19 

-0.69 

0.8211 


1934 

2.1 

1.6 

-0.79 

-0.59 

0.4661 


1935 

2.3 

2.0 

-0.59 

-0.19 

0.1121 


1936 

2.5 

2.4 

-0.39 

0.21 


-0.0819 

1937 

3.3 

3.0 

0.41 

0.81 

0.3321 


1938 

3.1 

1.9 

0.21 

-0.29 


-0.0609 

1939 

3.2 

* 2.3 

0.31 

0.11 

0.0341 


1940 

4.0 

2.6 

1.11 

0.41 

0.4551 


1941 

5.1 

3.3 

2.21 

rii 

2.4531 


S = 

28.9 

21.9 



5.8218 



= 2.89 

S:% = 2.19 

1 


IiXiX2 =* 5.6790 
1 


variable from its respective mean are calculated and entered in 
the third and fourth columns of the table. The products of Xi 
and x^y the product deviations, are calculated, and the results 
entered in the appropriate division of the last column.- The sum 
of column (6), that is, TtXiXi (the sum of the product deviations), 
is 6.679. 

In Fig. 109, an X\ and an X 2 scale are set up in such a way 
as to accommodate the range, of these variables as shown in 
columns (1) and (2) of Table 28. Lines perpendicular to the 
respective scales at the points = 2.89 and X 2 = 2.19 are 
drawn so that the figure is divided into four quadrants, quadrant 
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I containing values of Xi and that are both higher than 
average (hence both xi and xi are positive) ; quadrant II contain- 
ing values of X 2 that are smaller than average and values of X\ 
that are larger than average (hence X 2 is negative and Xi is posi- 
tive); quadrant III containing values of Xi and X 2 that are both 
smaller than average (hence both Xi and X 2 are negative); and 
quadrant IV containing values of X 2 that are larger than average 
and values of Xi that are smaller than average (hence x^ is 
positive and Xi is negative). The origin of the coordinates Xi, 
X 2 f is at the intersection of the perpendicular lines at the Xi 
and X 2 of the scales. For example, measured from the original 
origin, the point P has coordinates X 2 = 3.3, Xi = 3.0; but 
measured from the intersection of the means the coordinates of 
point P are X 2 = 0.41, Xi = 0.81 [see columns (1), (2), (3), and 
(4) for 1937, Table 28]. It should be noted that only one point 
is plotted in the second quadrant; this is the 1936 pair of variables 
from Table 28. The 1938 pair of variables from Table 28 
appears as the sole point in the fourth quadrant. These two 
pairs of variables, 1936 and 1938, are the only ones in the set 
that have negative product deviations. The rest of the pairs of 
observations appear either in the first or third quadrant because 
their product deviations are positive quantities. 

If the fluctuations of two variables are so associated that their 
plottings . appear predominantly in quadrants I and III, the 
2 x 1 X 2 will be positive. This will be so when larger than average 
values of Xi are associated with larger than average values of 
X 2 (quadrant I) and smaller than average values of Xi are 
associated with smaller than average values of X 2 (quadrant III). 
Also, if the two variables are so associated that their plottings 
appear predominantly in quadrants II and IV, the sum of the 
product deviations will be negative. This will be so when smaller 
than average values of X 2 are associated with larger than average 
values of Xi (quadrant II) and when larger than average values 
of X 2 are associated with smaller than average values of Xi 
(quadrant IV). Furthermore, if the plottings are equally 
distributed throughout the four quadrants, the sum of the 
product deviations will approach zero because of the canceling 
of plus and minus product deviations. This will be so when 
there is a tendency for association of the variables in any manner, 
that is, when smaller than average values of Xi are associated 
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about as often with larger values of as with smaller values of 
X^) etc. o 

A similar procedure is followed in Table 29 and Fig. 110, in 
which Xz is the price of United States government bonds and 
^4 is the yield on such bonds. Casual inspection of the data 
reveals that, when the price of bonds is high, yield is low, and 
vice versa. 

Table 29. — Pricbs and Yields on United States Government Bonds, 

1932-1941 


Averages on bonds outstanding due or callable after 12 years 


Yfear 

Average 

price 

(SlOO par) 

ATi 

Average 
yield, 
per cent 

Xa 

Deviations from 
respective means 

Product deviations 

XtXA 


XA 


(1) 

(2) 

(3) 

(4) 

(5) 






+ 


1932 

98.8 

3.68 

-5.62 

0.949 


-5.333 

1933 

102.3 

3.31 

-2.12 

0.579 


-1.227 

1934 

104.6 

3.12 

0.18 

0.389 

0.070 


1935 

105.5 

2.79 

1.08 

0.059 

0.064 


1936 

103.7 

2.65 

-0.72 

-0.081 

0.583 


1937 

101.7 

2.68 

-2.72 

-0.051 

0.139 


1938 

103.4 

2.56 

-1.02 

-0.171 

0.174 


1939 


2.36 

1.58 

-0.371 



1940 

107.2 

2.21 

2.78 

-0.521 


-1.448 

1941 

111.0 

1.95 

6.58 

-0.781 


-5.139 

S * 


27.31 1 


0 

1.030 

-13.733 



CO 

li 



or net 







-12.703 


The sum of the product deviations in Table 29 is a negative 
amount, namely, —12.703. Comparison of Figs. 109 and 110 
will at once bring out the contrast in the location of pairs of 
plotted points. Whereas in Fig. 109 the points are mainly in 
quadrants I and III, the points in Fig. 110 appear principally in 
quadrants II and IV. 

Again, the same procedure is followed in Table 30 and Fig. Ill, 
in which X^ is the height of Princeton freshmen and Xs is the 
grade of these freshmen in their examination in economics. 

In Table 30 the negative and positive product deviations so 
nearly offset each other that the sum of product deviations is 
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only 1.33. The tendency for the scatter of points throughout 
all four quadrants is depicted in Fig. Ill on page 345. 

These three arithmetic illustrations appear to show that 
the sum of the product deviations from the arithmetic means 
of variables, 2 Jxi:r 2 , can be used to measure the extent to which 

Average price 



Fig. 110. — A bivariate scatter diagram showing the joint variation in the 
price and yield of United States government bonds. [Federal Reserve Bulletin, 
December, 1938, p, 1045; July, 1940, pp, 701-702, and Survey of Current Business, 
Vol, 21 {March, 1941), p. 36; VoL 22 {March, 1942), p. 18.1 

the variables are associated or related. Following are the reasons 
for this: 

1. When smaller than average values of X\ are associated 
with smaller than average values of X 2 , the 0 : 1 X 2 products, being 
— xi and — X 2 , are positive, as shown in Tables 28 to 30. 

2. When larger than average values of Xi are associated with 
larger than average values of X 2 , the X 1 X 2 products, being +Xi 
and + 0 : 2 , are also positive, as shown in Tables 28 to 30. 
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3. On the other hand, when smaller than average values of 
Xi are associated with larger than average values of X 2 , the 
X 1 X 2 products, being — Xi and + 0 : 2 , are negative, as shown for 
1932 and 1933 in Table 29. 

4. When larger than average values of are associated with 
smaller than average values of X 2 , the XiX^ products, being +Xi 
and — X 2 , are also negative, as shown for 1935, 1936, and 1939 in 
Table 29. 


Table 30 . — Heights op Freshmen, Princeton Class of 1941 , and 
Grades on Examination in Economics 


Heights of 
freshmen, 
in. 

Xs 

Grades of 
freshmen, 
percentage 
of 100 

Xt 

Deviations from 
respective X 

Product deviations 

XiXt 


xt 

(1) 

(2) 

(3) 

(4) 

(5) 





+ 

— 

66.00 

70 

- 3.96 

+ 3.8 


15.048 

69.00 

67 

- 0.96 

+ 0.8 


0.768 

70.50 

66 

+ 0,54 

- 0.2 


0.108 

69.50 

85 

- 0.46 

+ 18.8 


8.648 

68.00 

55 

- 1.96 

- 11.2 

21.952 


70.50 

60 

+ 0.54 

- 6.2 


3.348 

70.50 

67 

+ 0.54 

+ 0.8 

0.432 


71.60 

81 

+ 1.64 

+ 14.8 

24.272 


73.25 

66 

+ 3.29 

- 0.2 


0.658 

70.75 

45 

+ 0.79 

- 21.2 


16.748 

2 => 690.60 

662 



4-46 6.'i6 

-4.'i 326 

= 69.96 

= 66.2 



or net 






+ 1.330 


5. When no consistent association prevails between the pairs 
of variables observed, the +X 1 X 2 products will balance or very 
nearly balance the --XiXz products, as shown in Table 30. 

The sum of the products of the deviations from the means 
indicates correspondence or lack of correspondence of variations 
in two sets of variables; but the simple sum of products cannot 
be taken as the measure of correlation between the two variables 
for the following reasons: 

1. The sum of product deviations for one set of paired vari- 
ables is not comparable with a similar sum of product deviations 
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for another set of paired variables. A small sum of product 
deviations may result from the fact that a small number of cases 
is included, and a large sum of product deviations may indicate 
merely that a large number of cases is involved ; and yet the actual 
degree of correlation might be the same in the two sets. In the 


Freshmen heights 



Per cent 


grades 


Fib. 111. — A bivariate scatter diagram showing the joint variation (or lack 
of it) between the heights of Princeton freshmen and their grades on an exami- 
nation in economics. 


second instance the larger sum of product deviations is due solely 
to the fact that it resulted from a larger number of cases. It 
seems obvious that an average of the product deviations is 
required. Such an average can be obtained by dividing the 
sum of product deviations by N, Thus the average product 
deviation is ^Xix^/N, 

2. The product deviations in terms of original units of the 
data are witlyiut meaning because of nonhomogeneity of units. 
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Suppose the Xi variable is the price of wheat per bushel, which 
would be expressed in dollars and cents; and the X 2 variable is 
the birth rate. Or, again, suppose the Xi variable is the height 
of men expressed in inches and the variable is the weight of 
men expressed in pounds. Or suppose the X\ variable is the 
marriage rate and the X 2 variable is the volume of trade, or 
prices, etc. In all such pairs of variables, the product deviations 
in terms of original units are meaningless; they are products of 
nonhomogeneous things. What meaning can be ascribed to the 
product of inches and pounds or to the product of marriage rates 
and volume of trade? It is necessary to perceive in the situation 
a general common denominator. 

The comparable thing being compared is the purely abstract 
thing, deviation above or below average; accordingly, the stand- 
ard deviation <r may be used as a general common denominator. 
Whatever the original unit of measurement, if normally dis- 
tributed the standard deviation represents approximately 
one-sixth the range of that variable. The standard deviation 
is a unit of deviation from the mean measuring a common 
characteristic among all variables and is, therefore, a homo- 
geneous unit among all variables. Consequently, the standard 
deviation is used to reduce these product deviations to terms of 
comparability with each other. When this is done, the average 
product deviation becomes a measure of correlation known as the 
Pearsonian coefficient, namely. 





Since (ti and <t 2 are constants in each particular problem, this 
equation may be written as follows: 


ri2 


^1X1X2 

N(Ti<T2 


This is the Usual form in which the formula for the Pearsonian 
coefficient of correlation is given. The value of this average 
expression fluctuates between the limits +1 and —1. Any 
value greater than +1 or less than — 1 is a mistake, not an error 
in the statistical sense. If r = +1, this means perfect positive 
correlation (large values of Xi are associated with large values of 
X 2 , and vice versa); if r = — 1, this means perfect negative 
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correlation (large values of Xi are associated with small values of 
X 2 , aftd vice versa); if ri 2 = 0, this means no linear correla- 
tion. 

Calculation of the Coefficient of Correlation, The data in 
Table 28 may be taken to illustrate the detailed calculation 


Table 31 . — Calculation of Coefficient of Correlation between 
United States Exports and Imports, 1932-1941 


Deviations from 
respective means, 
billions of dollars 

Squares of (ieviations 
from respective means 

Deviations from 
respective means in 
standard deviation 
units 

Product deviations in 
standard-deviation 
units 


Xl* 


Xl Xi 

X\ Xl 




. 

Cl 

(1) (2) 

(3) 

(4) 

(5) (6) 

(7) 




1 

+ 

— 

- 1.29 - 0.89 

1.6641 

0.7921 

- 1.251 - 1.435 

1.795 


- 1.19 - 0.69 

1.4161 

0.4761 

- 1.154 - 1.112 

1.283 


- 0.79 - 0.59 

0.6241 

0.3481 

- 0.766 - 0.951 

0.728 


- 0.59 - 0.19 

0.3481 

0.0361 

- 0.572 - 0.306 

0.175 


- 0.39 , 0.21 

0.1521 

0.0441 

- 0.378 0.338 

1 

- 0.128 

0.41 0.81 

0.1681 

0.6561 

0.398 1.306 

0.520 


0.21 - 0.29 

0.0441 

0.0841 ’ 

0.204 - 0.467 



- 0.095 

0.31 0.11 

0.0961 

0.0121 

0.301 0.177 

0.053 


1.11 0.41 

1.2321 

0.1681 

1.077 0.661 

0.712 


2.21 1.11 

4.8841 

1.2321 

2.144 1.789 

3.836 



10.6290 

3.8490 


S = 9.102 

- 0.223 





or net 


(Ti = 1.031 

< r , = 0.6204 








V - 

= 8.879 





/V ffi ff2 



The sUtndard dcviationa were calculated from th& sum of columns (3) and (4). 


of the coefficient of correlation, by first making all product 
deviations in terms of respective standard-deviation units. 

The Pearsonian coefficient of correlation may now be quickly 
calculated from the sum of product deviations in standard- 
deviation units [the foot of column (7) of Table 31]. This sum 
divided by iV is the coefficient of correlation. In other words, 


r 


2 


<ri 


N 


8.879 

10 


0.8879 
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It is not necessary, however, to divide each deviation by its 
standard deviation because the two standard deviatiofts are 
constants. Table 28 having been constructed, if the standard 
deviations are calculated, as in columns (3) and (4), Table 31, 
it is then necessary only to use Eq. (3), as follows: 

_ Sxiaja _ 5.6790 0.5679 

NiTKJi 10(1.031)(0.6204) 0.6396 

== 0.8879 

Accordingly columns (5) to (7) of Table 31 need not be computed. 
For example, to calculate the coefficients of correlation for the 
data in Tables 29 and 30, the standard deviations are calculated 
and the respective coefficients of correlation are then obtained 
as follows : 

Correlation between prices and yields on United States 
government bonds, 1932-1941: 

_ ^XzXi _ —12.703 

“ JV<r3<r4 “ 10(3. 16) (0.51) 

= -12703 
1.6116 
= -0.7882 

Correlation between heights and grades of freshmen: 

YiX^xz 1.33 

~ N^t ~ 10(1.89)(10.96) 

0.133 _ 0.133 ffi = 1.889 

(1.89)(10.96) 20.7 , <r, = 10.96 

= 0.0064 

For a small number of cases it is possible to calculate a coeffi- 
cient of correlation according to the procedure illustrated in the 
tables and calculations immediately preceding. For a large 
number of pairs of values it is desirable to group the pairs into 
class intervals. The value of Xi for each pair then becomes the 
mid-point of the interval to which the Xi value belongs; the 
value of X 2 for the pair will be the mid-point to the interval 
to which the X 2 value belongs. If more than one pair of cases 
belongs to the same Xi and X 2 intervals, the frequency of such 
pairs is determined. This procedure was illustrated by the 


ers = 3.16 
(Ti == 0.51 
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analysis of Tables 24 and 25 in discussing the bivariate frequency 
distribution of 81 Mount Holyoke freshmen.^ When the bivari- 
ates are arranged in a bivariate frequency distribution, ri2 is 
measured by 1 :>FxiX 2 /N(ti(t^ where F represents the frequency of 
pairs of values belonging to the same X\ and X 2 intervals. For 
example, in Table 26 (for Xi = 160-, X 2 = 120-), F = 5, 
x\ = 47.4, and x^ = 74.1. Accordingly, this Fa; 1X2 (for X\ = 160-, 
X2 = 140-) is equal to 5(47.4)(74.1) = 17,561.7. When this pro- 
cedure is followed for the entire table, the ^Fxix^ is obtained. 
Special methods for calculating r from grouped data are described 
in detail in Chap. XIV, in which advantage is taken of certain 
short-cut procedures. 

Relationship between Lines of Regression and r. If a line 
of regression is fitted by the method of least squares, the values 
of hn and 621 are given by Eqs. 6 and 10. It will now be shown 
that these reduce to formulas involving ri2. From the defini- 
tion of r = 11 x 1 X 2 / N(T\cr 2 , 


llX\X2 = N(TiCr2Ti2 

Secondly, note that, from the definition of g\ = Sxg/iV, 

Ixl = N(tI 


Hence, 


^ _ 1 X 1 X 2 _ N<r\<T 2 T \2 _ (T\ 


In the same manner it can be shown that 

I ^2 

O21 — Ti 2 — 

O'! 

Hence, if deviations of the variables are measured from their 
mean values, the lines of regression may be written (in this case 
the ai.2 and 02.1 are both zero) 


r 

X\ = ri2 — X2 
(7’2 

/ 

X 2 = ri2 — Xi 
O’! 


(14) 

(15) 


If the first of these equations is divided by <ri and the second by 
* See pp. 325-326. 
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<r», they become 

— = ri2 — 

ffi <T% 

^ 

— = ri2 ■— 

<72 <71 

Thus it may be concluded that if the variables are measured in 
standard-deviation units, the slopes of the lines of regression are 
the Pearsonian coefficient of correlation. In this light, ri 2 is 
the change in the average value of X\ expressed in <7 units when 


<5 


Fig. 112. — Diagram showing relationship between lines of regression and the 
Pearsonian coefficient of correlation r. 

X 2 changes by one <72 unit. It is- also the change in the average 
value of X 2 expressed in <72 units when Xi changes by one <7i unit. 

This property of r is shown geometrically in Fig. 112. This 
shows that the slope of the regression of Xi on X 2 is r, with 
reference to the Jf 2 -axis, and 1/r with reference to the Xi-axis; 
that is to say, the line of regression of Xi on X 2 makes an angle 
with the JSr 2 -axis equal to r and an angle with the Xi-axis equal 
to 1/r. The slope of the regression of X 2 on Xi is likewise r, 
but with reference to the Xi-axis, and 1/r with reference to the 
Xs-axis; that is to say, the line of regression of X 2 on Xi makes an 
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angle with the Xi-axis equal to r and an angle with the Xa-axis 
equal 40 1/r. In other words, in Fig. 112, angle a equals a', 
and angle 6 equals angle All this is on the assumption 
that the variables are expressed in standard-deviation units as 
indicated in the equations above. 

Thus, in Fig. 112, the tangent of the angle a is r, and that of 
the angle 6 is 1/r. When |a| ^ 7r/4,. r = tan a ^ 1. Geo- 
metrically, within the limits \a\ ^ ir/4, tan a varies between -fl 
and —1, passing through zero, and tan 6 between +1 and — 1, 
passing through infinity. The two lines of regression merge 
into one line when r = 1 (for tan a = 1 when the angle is a 
45-degree angle). 

Relationship between r and the First-order Standard Devia- 
tion. It will be recalled that the standard deviation of the 
deviations from the line of regression of Xi on X 2 is equal to 

= SX? - ai.sZXi - 612SX1X2 


If the variables are measured from their mean values, this 
becomes 

N < t \,2 = — 612SX1X2 

But 


Sxf = N<r\ bi 2 = and XxiX 2 = Nai(T 2 T \2 

<72 


Hence, 



and 

(TX 2 ~ <^l(l “ ^ 12 ) 


Finally, 

ffi.i = (Ti \/i — 

(16) 

In the same manner, 

<72,1 = 0-2 \/l — ri2 

(17) 

These formulas may also be put in the form 



.2 - 1 _ d?. 

Tn - ^2 

(18) 


/•*-!- 
^12 ^ 2 
<72 

(19) 


It will thus be seen that r is closely related to the scatter 
about the lines of regression. If this scatter is a small percentage 
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of the total scatter, indicating a high degree of representativeness 
of a line of regression, then r is high. If the scatter is a largo 
percentage of the total variation in the dependent variable, 
indicating a low degree of representativeness of a line of regres- 
sion, then the value of r is small. In other words, the better 
a line of regression fits the data, the higher the value of r, and 
vice versa. The Pearsonian coefficient of correlation is thus a 
measure of the goodness of fit of the lines of regression. 

The Pearsonian Coefficient of Correlation and the Analysis of 
Variance. For every point on a bivariate scatter diagram such 
as Fig. 107, there is a corresponding point on the line of regression 
of Xi on Xi. Geometrically this is obtained by projecting the 
point vertically onto the line of regression (see Fig. 107). Alge- 
braically, the Xi coordinate of a point on the line of regression is 
found by substituting the given value of in the regression 

equation x[ = ri 2 ~ X 2 - 
<T2 

When the variables are measured from their mean values, tiie 
mean of the various values of x^ is zero. Hence the mean of 
the corresponding values of x[ is jsero also. The standard 
deviation of these x[ values is thus 




= 


Equation (16) may thus be written 

' 2 2 2 
<^ 1.2 = 

or 

<^1 = + <' 1.2 (20) 

This says that the total variance of the Xi values is equal to the 
variance of the corresponding points on the line of regression 
plus the variance of the deviations from these points. Another 
way of looking at this is to regard the total variation in Zi as 
made up of two parts, one consisting of the variation (o-^^/) due 
to its association with X^ as represented by the line of regression, 
the' other representing the variation in Xi due to its association 
with factors independent of X 2 (that is, (ri.g)- 
Similar analysis shows that 


<4 - 4,' + 4.1 


( 21 ) 
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in other words, that the total variance in X2 is made up of^a part 
(al .) due to its association with Xi as represented by the line of 
regression of Xi on Xi and a part ((Tj.i) due to its association 
with factors independent of Xi as measured by the deviations 
from this line of regression. 

The formula (rl^> = which may also be written rjj = 
sheds further light on the meaning of r. It shows that 
measures the proportion of the total variance in Xi that is due 
to its association with Xi. It also measures the proportion 
of the total variance in Xi that is due to its association with Xi. 



CHAPTER XIV 


COMPUTATION OF r AND OTHER MEASURES 
OF CORRELATION 

The previous chapter was concerned with an explanation 
of the various devices used to measure the association between 
two variables. This chapter will illustrate their use by carrying 
out a numerical analysis. Only linear correlation will be con- 
sidered here. Measures of nonlinear correlation will be discussed 
in Chap. XV. 

The order of analysis will be first to calculate the correlation 
coefficient. This will be done for both ungrouped and grouped 
data, and use will be made of short-cut methods of calculation. 
For the grouped data, lines of regression will be computed^ and 
first-order variances and standard deviations. Reference will 
again be made to the progressions of means, but the analysis will 
be continued no further than in the previous chapter. 

Computation of r from Ungrouped Data. Since x = X — X, 
it follows that 

XxiX2 = S(Zi - Xi)(Z2 - X 2 ) = 2 Z 1 Z 2 - 
Likewise^ ai and cr 2 are equal to 

and - Jti 

Hence the correlation coefficient can be computed from the 
equation ^ 

r-l- = 2X1X2 - VZ1Z2 

V2X! - NJt\ Vsxi - NS:i 

To illustrate the use of this formula consider again the data 
on exports and imports of Table 28.^ These are reproduced 
in Table 32, together with the calculations of SXiX*, Xi, ^ 2 , 
SZi, and Three check columns are also employed. The 

checks are column (1) + column (2) = column (6) ; 

column (3) + column (4) = column (7); 

and column (3) + column (6) = col umn (8). 

* See p. 340. . 
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In the preliminary calculations of Table 32, the last check 
failed. This showed that a mistake had been made in either 
column (5) or column (8), for column (3) checked with columns 
(4) and (7). After some investigation the mistake was found in 
column (8). By dividing the checks up in this way, an error 
can be easily located. This sort of check is called a ‘^Charlier 
check.^^ 


Table 32 . — Work Sheet for Computing r from Ungrouped Data 


( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 

( 7 ) 

( 8 ) 

Xi 

Xt 

XiXi 

.Xi* 


Xi + Xt 

Xi(Xi + X!) 

XtiXi + Xt) 

1.6 

\ 

1.3 

2.08 

2.56 

1.69 

2.9 

4.64 

3.77 

1.7 

1.5 

2.55 

2.89 

2.25 

3.2 

5.44 

4.80 

2.1 

1.6 

3.36 

4.41 

2.56 

3.7 

7.77 

5.92 

2.3 

2.0 

4.60 

5.29 

4.00 

4.3 

9.89 

8.60 

2.5 

2.4 

6.00 

6.25 

5.76 

4.9 

12.25 

11.76 

3.3 

3.0 

9.90 

10.89 

9.00 

6.3 

20.79 

18.90 

3.1 

1.9 

5.89 

9.61 

3.61 

5.0 

15.50 

9.50 

3.2 

2.3 

7.36 

10.24 

5.29 

5.5 

17.60 

12.65 

4.0 

2.6 

10.40 

16.00 

6.76 

6.6 

26.40 

17.16 

5.1 

3.3 

16.83 

26.01 

10.89 

8.4 

42.84 

27.72 

S = 28.9 

21.9 

68.97 

94.15 

51.81 

50.8 

163.12 

120.78 

= 2.89 

Xj = 2.19 








Checks: * 

2 ( 1 ) + 2(2) = 2(6) 

28.9 + 21,9 * 60.8 
2 ( 3 ) + 2 ( 4 ) = 2 ( 7 ) 

68.97 + 94.15 >= 163.12 
2 ( 3 ) + 2 ( 5 ) » 2 ( 8 ) 

68.97 + 51.81 -= 120.78 

From Table 32, r is found according to Eq. (1) to be equal to 

68.97 - 10(2.89) (2. 19) 

" \/(9ri5 - 10 X 2l9*) \/(^i - 10 X 2l9*) 

68.97 - 63.291 

“ V(94.15 - 83^5^ V(5m'^^~47:9^ 

5.679 _ 5-67^ = V10629 

V( 10.629) V'(3.849) (3.26)(1.96») «rj = V^49 

5.679 
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Table 33. — Gbades in Seconi>- and First-semester English, 81 Fresh- 
men AT Mount Holyoke » 

(A, B, C, and D grades have been converted to a numerical scale) 


Student 

number 

Second- 

semester 

grade 

Xi 

First- 

semester 

grade 

Xi 

Student 

number 

Second- 

semester 

grade 

Xi 

First- 

semester 

grade 

Xi 

1 

240 

220 

41 

260 

260 

2 

200 

180 

42 

180 

160 

3 

260 

240 

43 

100 

60 

4 

260 

260 

44 

200 

220 

5 

160 

160 

45 

200 

200 

6 

240 

220 

46 

160 

120 

7 

220 

200 

47 

180 

160 

8 

60 

120 

48 

280 

220 

9 

220 

^40 

49 

200 

200 

10 

200 

180 

50 

220 

220 

11 

220 

220 

51 

220 

200 

12 

140 

180 

52 

240 

220 

13 

160 

120 

53 

100 

60 

14 

240 

260 

54 

220 

220 

15 

260 

240 

55 

240 

200 

16 

200 

160 

, 56 

200 

220 

17 

200 

160 

57 

220 

220 

18 

240 

240 

58 

220 

200 

19 

240 

220 

59 

240 

200 

20 

240 

220 

60 

180 

140 

21 

160 

140 

61 

160 

140 

22 

220 

240 

62 

240 

220 

23 

200 

200 

63 

260 

260 

24 

100 

100 

64 

160 

120 

25 

160 

140 

65 

260 

240 

26 

200 

160 

66 

220 

180 

27 

180 

180 

67 

220 

240 

28 

180 

160 

68 

260 

260 

29 

240 

240 

69 

240 

220 

30 

200 

200 

70 

200 

200 

31 

200 

200 

71 

140 

120 

32 

180 

160 

72 

260 

240 

33 

260 

220 

73 

200 

180 

34 

160 

120 

74 

300 

280 

35 

240 

240 

75 

180 

140 

36 

220 

220 

76 

220 

180 

37 

220 

240 

77 

180 

180 

38 

160 ♦ 

120 

78 

300 

280 

39 

200 

200 

79 

220 

220 

40 

220 

220 1 

80 

220 

200 




81 

200 • 

180 
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This agrees to two decimal places with the previous calculations 
of I4iis coefficient made in Chap. XIII. The difference is 
due to the different ways of rounding off decimals in making the 
calculations. 

Computation of r from Grouped Data. The Data. The data 
to be used to illustrate the computation of r for grouped data 
are given in Table 33. They may be explained as follows: 

First pair Xi, X^. The first pair of observations are the 
second-semester and the first-semester English grades, respec- 
tively, of student 1, viz.^ 240,220. 

Second pair Xi, X 2 . The second pair of observations are the 
second-semester and the first-semester English grades, respec- 
tively, of student 2, viz., 200,180. 

Third pair Xi, X 2 . The third pair of observations are the 
second-semester and the first-semester English grades, respec- 
tively, of student 3, viz. 260,240. 

The Correlation, or Bivariate Frequency, Table. After the data 
have been tabulated as in Table 33, a correlation table, which 
is in effect a bivariate frequency distribution, is constructed. 
The table is set up with class-interval scales suitable for each 
variable,^ and additional columns and rows are arranged for the 
required calculations. In the center of each cell of the correla- 
tion table, frequencies are shown; for example, in the first 
column opposite the Xi scale of Table 34, 2 is the frequency of 
occurrence of Xi between 100 and 120 and X 2 between 60 and 80. 
Two students, in other words, have grades in second-semester 
English between 100 and 120 and grades in first-semester English 
between 60 and 80. When all the frequencies are recorded in 
the correlation table, it may be used as a work sheet for the 
calculation of the coefficient of correlation. 

Short Method for Calculating r. Like the standard deviation 
and the mean, it is possible to find r by a short method making 
use of arbitrary origins. 

In the formula for r, 


l^ FxiXj 

N(ri<T2 


( 2 ) 


O'! and 0*2 may be calculated by the short method that has already 
been presented.^ It remains only to evaluate 'LFxxx^ in terms 
^ On the question of proper selection of class intervals, see pp. 199-206. 
* See pp. 214-215. 
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of deviations from the arbitrary origins, Ai and As, selected 
for liie respective variables and in terms of the two correction 
factors Cl and C*. To do this, note that* 


where 


xi = di — Cl 


Cl 


XFdi 

N 


and 

where 


X2 = di — Ci 

^ ^Fdi 
Ci - 


Therefore, 

^FxiXi = 2F(di - Ci){di - Ci) 

which expanded is as follows: 

i:FxiXi = 2Fdidi -CiXFdi - CiXFdi + NCiCi 

But IFdi = NCi and XFdi = NCi, and hence 

"ZFxiXi = XFdidi — NCiCi 


(3) 

(4) 

(5) 


and accordingly the formula for calculating r by the use of an 
arbitrary origin for and an arbitrary origin for Xz is 


^Fdidi - NCiC i 

N<Tl<Ti 


( 6 ) 


Further saving in calculation results, however, if this formula 
is put in terms of class-interval units. In other words, the follow- 
ing form is more conveniently used:* 


ri2 


Cl Ci 

il ^2 ^2 


il 12 


(7) 


The correlation table serves as a work sheet for the calculation 
of the coefficient of correlation, as follows: 

^ When Cl « *= -f Ci. By definition xi = Xi — Xi and 

di « Xi — A; so that Xi ** di -f- Ai — Ai — Ci = di — Ci. 

®The value of the numerator alone is 'LFxix^] if the problem is one in 
multiple correlation, it will be convenient to have a record of this value as 
well as the value of ri 2 . 
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1. An arbitrary origin is chosen for each variable; thus, in 

Table 34, = 190 and A 2 = 190. The arbitrary origins are 

taken at the mid-points of a class interval about midway in the 
range of the distribution in order to reduce to a minimum the 
necessary computations. 

2. A column at the side and a row at the bottom of the cor- 

relation table are used to indicate, in class-interval units, the 
deviations of each variable from the respective arbitrary origins. 
This supplies entries for the rows under the caption d\/ii and 
entries in the columns opposite the stub headings d 2 /i 2 - In Table 
34, == ^2 “ 20. 

3. The next column at the side and row at the bottom of Table 
34 are for the purpose of entering the frequencies multiplied by 
the class-interval deviations. The sums of this column and row, 
respectively, are used in the calculation of the correction figures 
Ci/iy and C 2 /i^ and in the computation of the means of the 
separate frequency distributions. The sums of the columns give 
the separate frequency distribution of X 2 , and the sums of the 
rows give the separate frequency distribution of Xi. 

4. The next column and the next row are for the frequencies 
multiplied by the class-interval deviations squared, in order to 
obtain sums from which to calculate the standard deviations of 
the respective variables. 

5. The means and standard deviations of the two variables 
are calculated as follows 

Calculation of the means: 

Xi = 190 + W(20) = 190 + 27.40740 
= 217.4 

X 2 = 190 + ti(20) - 190 + 14.074 
= 204.1 

Calculation of the standard deviations:* 

(ii)’ = w- (w)’ - 

== 4.82679 
= 2.1968 

<ri = 43.94 

^ Using Eq. (3), p. 213. 

> Using Eq. f6), p. 216. 
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= 5.59123 

^ = 2.3646 
^2 

(Ta = 47.29 


6.08642 - 0.49519 


6. The product of the deviations from the chosen arbitrary 
origins is ^ifained for each cell in the correlation table. This is 
obtained by multiplying the di/iy by the ^ 2/^*2 corresponding to 
the position of that cell. For example, for Xi = 100- and X 2 = 
60-, the cell in the first column and third row of Table 34 there 
is a frequency of 2. According to the chosen arbitrary origins, 
this cell has a product deviation (in terms of class-interval units) 
of —6 multiplied by —4, or -f24. Symbolically, this is (diA'i) 
(d 2 /t 2 ). The table is divided into four quadrants by lines 
through the Ai and A 2 . 

Two of these quadrants will have positive product deviations, 
and two will have negative product deviations. A product 
deviation is entered in each cell that contains frequencies and 
appears in the lower right corner of the cell. None are entered 
in the first quadrant because no frequencies occur in that quad- 
rant. Frequencies occur in only one cell in the third quadrant, 
that is, in the Xi = 200-, Z 2 = 160- cell, for which the deviation 
product is —1 multiplied by +1, or —1. 

7. The product deviation in each cell is multiplied by the 
frequency occurring in that cell, in order to obtain the proper 
number of product deviations of that particular cell. The 
product deviation occurs once in some cases and several times in 
others. Obviously, when it occurs several times the sum of the 
product deviations is obtained by multiplying by the frequencies. 
These figures are entered in each cell in the upper right comer. 
Symbolically, they are F{di/ii){d 2 /i 2 ), for each cell. 

8. The sum of the figures calculated in item 7 is obtained, that 
is, the sum of the product deviations multiplied by their respec- 
tive frequencies. This is accomplished by adding the figures 
occurring in the upper right comer of each cell by rows and by 
columns and adding the sums of the rows or the sums of the 
columns to ^obtain the final sum. If both the latter are com- 
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puted, there will be a cross check on addition. Symbolically, the 
final aggregate is ^ 

9. The coefficient of correlation is calculated by the use of 
Eq. (7) shown above, as follows; 

Calculation of r: 

_ 455 - sm w- 

81 (2. 19677) (2.36468) ' 

_ 455 - 78.11111 _ 376.88889 
420.74964 420.74964 

Lines of Regression and Fir^t-order Variances. All the values 
that are needed to find the lines of regression of Table 34 have 
now been calculated. There are two lines of regression for each 
correlation table — the first one represents the regression of Xi 
on X 2 and the second the regression of X 2 on Xi, Since r has 
been computed, the easiest formulas for calculating these two 
lines (in original units measured from the intersection of the 
means of the two variables as an origin and not in class-interval 
units) are as follows: 

/ (Ti f cr 2 

x[ = r --X2 X2 = r --xi 

0-2 O’! 

These equations can be expressed in the units of the original 
data, that is, the scale as originally formed rather than in devia- 
tions from the means, as follows: 


+ 0 . 8^76 


X[- Xi = r^ (X2 - :x 2) X'2- (Xi - Xi) 

a’2 <7i 

Calculation, For the problem illustrated, the lines of regres- 
sion are calculated as 


x[ 


0.89576 


^9354 

47.2915 


x '2 


0.89576 


47.2915 

43.9354 


= 0.8322x2 = 0.9642a:i 

By substituting X[ — Jfi for x[ and X 2 — X 2 for Xj, these 
equations may be written as follows; 

X[ - 217.4 = 0.8322(X2 - 204.1) 

X[ = 0 . 832 X 2 + 47.68 
XJ - 204.1 = 0.9642(Xi - 217.4) 

X '2 = 0.964X2 - 5.65 

In this form the equations are more easily interpreted as 
prediction equations. The first equation says that when a 
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student has a grade of 100 in first-semester English the predicted 
grades in second-semester English is 83.2 + 47.6 = 130.8. The 
second equation says that when a student has a grade of 100 
in second-semester English the predicted grade in first-semester 
English is 96.4 — 5.65 = 90.8. 

The two lines of regression are shown in Figs. 105 and 106.^ 
In Fig. 105 line aa' represents the first line of regression, 

X[ = 0.832X2 + 47.58 

The small crosses show the location of the means of the columns 
(calculated and shown in Table 34). It is to be noted that the 
line of regression follows the progression of the means of the 
columns. 

In Fig. 106, line bb' represents the second line of regression 
X 2 = 0.964Xi — 6.55. The small circles show the location of 
the means of the rows (calculated and shown in Table 34). 
It is to be noted that the line of regression follows the progression 
of the means of the rows. 

The scatter about each of the lines of regression, the first-order 
(T, is calculated by using the following formulas 

<71,2 = <ri Vl ■“ ^12 ^ 2.1 = <72 Vl ^ 'f'n 

In the problem illustrated, the first-order variances are 
calculated as follows: 

<71.2 = 43.94(0.44453) < 72.1 = 47.29(0.44453) 

= 19.53 = 21.02 

(When r = 0.89576, = 0.44453.) 

In Figs. 105 and 106, which show the lines of regression, there 
are also shown the limits indicated by the first-order standard 
deviations. Between these limits, that is, the line of regression 
±<71.2 for Fig. 105 and the line of regression ±< 72.1 for Fig. 106, 
lie roughly two thirds of the frequencies, if it can be assumed 
that the population from which the sample is derived is normally 
distributed. This gives some idea of how accurate estimates 
based upon the lines of regression are likely to be. It is to be 

^See pp. 328, 329. 

* Calculation of y/l — r* and 1 — r* is avoided by the use of J. R. Miner, 
Tables of y / 1 — r* arid 1 — r* for Use in Partial Correlation and in Trigo- 
nometry, or an ordinary table of sines and cosines, since sin x « y/lT— cos* 
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noted that all the means of the columns lie within the limits 
described by the first-order standard deviations and that all but 
two of the means of the rows lie within these limits. Each of 
the latter two means of rows, lying outside these limits (the first 
row and the next to the last row) is based upon only one student’s 
record. 

Progressions of Means. For these data the progressions of 
means have already been discussed in Chap. XIII. Figures 
105 and 106 show the means of the columns and the means of the 
rows plotted from the values computed in Table 34 and repro- 
duced in Table 26.* Figure 105 represents the means of the 
vertical frequency distributions of Tables 25 and 34; it gives the 
progression of the means of Xi with changing values of X^. Fig- 
ure 106 gives a similar analysis for the means of the rows. 

I See pp. 330 and 358. 



CHAPTER XV 

NONLINEAR CORRELATION 


All the foregoing discussion has been concerned with those 
cases in which the progression of the means is linear. In such 
cases it was found that r = 11x1X2 / Nckt 2 was an appropriate 
measure of correlation. If the progression of means and the 
distribution of cases around the means is as pictured in Fig. 113 , 
however, r may show little correlation, especially in such cases 
as A and C, although there may be a high degree of association 
between the variables. It is the purpose of this chapter to 
indicate ways of describing and measuring such nonlinear 
correlation. 

As indicated in an earlier chapter, the best way of studying 
any correlation is to make a bivariate scatter diagram of the 
data. If the data are numerous enough to be grouped into class 
intervals, then the means of the rows and columns may be 
computed and the variation in the means of each variable with 
changes in the other variable may be studied. 

In the linear case in which a line of regression was used to 
measure the association it was found that, the smaller the 
scatter, the higher the degree of correlation, the equation being 


The same sort of formula may be used to measure the degree 
of relationship indicated by the progression in the means. To 
distinguish them from the correlation coefficient these measures 
are called ‘^correlation ratios'' and are defined by the formulas 


- 1 
vh = 1 


fZriL- 


\ 


/ 


( 1 ) 


where represents the means Xi for various values of Xt, 


366 
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Xr represents the means of for various values of Xi, and 
represent the sum of the squared deviations 
around the. means pooled for all the column or row means and 


Ranges for 
calcufafing 








c i 

N 

r i 

N 


The conelalioii ratios i)ti and give some indication of thh 
degree to which the means one variable ara successful in 
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^ j^riation in the other variable. They may be 
^ either linear or nonlinear correlation. 

II T/new^ # variable seem to mark off a defimte curve 

or if in tfe c ^Trwgrouped data a bivariate chart indicates a 
fairly defeiit ot nonlinear variation, then the average 

variation y^lil'^igriabte with changes in the other variable may 
be indfo^l^Ky drawing a smooth curve or fitting one by some 
mathemal^rproc^j. such as the method of least squares. 
Such a ctfi^&ij^t called a curve of regression. A line 
of regie ssi<i\ theitoe hand indicates the averap chaise 
in one \ari 4 «i 1 lli with lltinit change of the other variable, this 
average tlian|e la the laiiie for all values of the independent 
variable, slof> 4 ;of a straight line is constant. A curve 

of regression pil the othi hand gives the average change in one 
variable with a 6 ^ ixi the other variable ; but this average 

change varies the independent variable to 

another, since ci||^t changes at each point. The 

of 'regression will be discussed in a 

10 measure rpe a5Dgxife^^ijj|th which a curve of regression 
measures the assOeiati<^ ISjCtwee*! two variables, an index of 
correlation is defined m a fk^apner similar to the definitions of 
r and ij It depends on the^lo^cucss with which the various 
eases are scattered about the cui^ve ai d ia defined by the formula 

Am 


technique of fittiittj| 
subsequent sectjpn/ 
To measure i|iie 



where C12 and C21 refer to the regression 
to the variance of the deviations from the curved 'hi 
Xi on A2, and refere to the variance 01 th| doviatiol^s 

from the curve of regression of X2 on Xi. ^ ' 

Although ri2 == rn, the two correlation ratios 
indexes of correlation are not necessarily equal. 

V12 5^ ^^21 and In 9^ In* In addition, ^ ^ r. ^ ^ 

Since the variance about the means or about a curve is never| 
greater than the total variance, these formulas always give a 
positive value and their square roots are indeterminate as to 
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sign. The square roots of and P giw an indi^x' 
and the question as to whether it is positive or 
tiohship must be answered by reference 
or a figure showing the polygon dr cvu 
the case of curvilinear correlation, the qu< 
negative relationship often is irrelevant 
may be positively correlated up to f. oe: 
negatively correlated beyond that {>- 
becomes necessary to describe the ei 
example, the death rate due to puerp< r 
with ages of the female population i’) a 
relationship between the two is b<“-t 
polygon of regression, which would hiiVe ^tb^ 
if the relationship is to be complet oiy <l€^ers 
trated in Fig. 55 (page 151). If r 
might conceivably be zero when 




then 
illy, il 
For 

Interrelated 
" er. The 
curve or 
ils entirety 
iC^Y’^rhis is illus- 
talculated, it 
;.a close relation- 


calculation 


ship. An index of such a ] 

of the ri's or the Fs. 

Calculation of the Correlatioi calculation of the 

correlation ratio will be illustrail^^feF refers to the Mount 
Holyoke data in Table 34 (pag<‘ Although the relationship 

appears to be linear, it is wort/i.^h^* to compute the correlation 
ratio to see how close it corner to ^ If thr di^^?rence is not veiy 
great, the linearity will be n un^^cally luoijs trated.^ 

Equation (1) for the i‘<>rr lit'tion ridics may be put in the form 





’?2l == 


( 3 ) 


where abbro' ifOed expressions for and 

.thus represent ;iverage standard deviations around 
th^ means of the columns and the means of the rows, respectively, 
explained above, in order to apply these equations for finding 
tiiiie correlation ratios it is necessary to find the values of erj, , and 
® This can be most conveniently done with the help of a 
SVork sheet that makes use of arbitrary origins {Ai and A 2 ) and 

4lass-interval deviations and Such a woi‘k sheet is 

\^i ^2/ 

Table 35 in which the computations are carried out for the data 
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of Table 34. The algebraic foundation for these computations 
is as follows: 

It is assumed that the same Ai is used for every column as 
for the total frequency distribution of X\. Then for each 
column the sum of the squares of the deviation from the column 
mean would be^ 



For the sum of all columns this would be 

m Nc Ne 

V V - X.) „ r at, (V F 


1 1 


But 


m Ne 

ll 

1 1 
rn Nc 




and by definition 22 (AYc - 
1 1 

Therefore, 

Nc 

(2^'0' 


N<r: 

V 


f'=2K^y-2 

1 1 

It has been determined alrcad}" that- 


A^c 


Na] 


. 2''’'’ 


,r- = 2K0- 


(2^^y 


N 


(4) 


(5) 


^This follows from Eqs. (1) and (2) of Chap. VII. For it will be noted 




/ 


N 


and 


v[ * 2F N and ^2 = XFx^/N, 


*See Eqa. (1) and (2) of Chap. VII and previous footnote. 
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Each of the variances in Eq. (3) may be expressed in class- 
interval units so that its numerator is the arithmetic difference 
between Eqs. (5) and (4) and its denominator is Eq. (5). Thus, 

If Nc ■ N 

Vli = ~ 7—1 T\2 (®) 

Similarly it can be shown that for a table with rows 



All the items in these two formulas (6) and (7) are to be found 
on the work sheet in Table 34 with the exception of 



These two figures are obtained from the correlation-ratio work 
sheet (Table 35). 

* In Table 35, the frequency is placed in large type in the center 
of a cell. Each column is now regarded as a separate frequency 
distribution whose total number of cases No is shown in the 
row headed Nc- For each column the same arbitrary origin 
(Ai = 190) as that used in Table 34 is used; hence the same 
di/ii can be used for each colunm. 

For all 11 columns in the upper right corner of eax;h interval 
that contains a frequency is a number in small type representing 



TABL.E 35. — Correlation-ratio Work Sheet 

Calculaiion of iji» and riti between second-semester (Xi) and first-semester (X 2 ) grades of 81 Mount Holyoke freshmen 
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5.00 13.33 
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the F ^ for that interval of the column. These are then summed, 

ti ' 




givii^g for each column a V f each of these sums is shown 

1 


lie 


in the row with the stub title 




If each sum is divided 


by the number of cases in the column Ne and multiplied by i, 
the resulting number is the correction factor Cc for that column. 
Accordingly, the mean for that column (A*) can be found by 
using the formula Xc = Ai + Ce. The results of this calcula- 
tion are shown in the row with the stub title — and the 

iV e 

column means are shown in the row with the stub heading Xc. 

In order to obtain the figure to be used in the formula for the 
correlation ratio — that is, for the square root of ijf 2 — another 
row of figures is now added to Table 35; this set of figures con- 


N. 


sists of the (2-0, /Nc for each column; and when these 

are summed for all columns (say for “to” columns), the resulting 
figure is as follows: 

N. 


2 

1 


( 2-0 


No 


= 473.1215 


Using Eq. (6), the correlation ratio of X\ on X^ is thus* 



473.1215 


543 - 


81 

81 


_ 473.1215 - 152.1111 
643.0000 - 152.1111 
= 0.82123 


iji2 — 0.9062 


321.0104 

390.8889 


• The values of 2) “ 111 =» 543 are fouud in Table 34. 

In that table the same At was used for the frequency distribution of Xi. 
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To calculate the means of the rows and the correlation ratio 
of Xv on Xi, every row of Table 35 is treated as a separate 
frequency distribution. The same A 2 is used for each row as 
the in Table 34 for the entire X 2 distribution. Accordingly, 


the same set of ^ 2/^2 niay be used for each row. 


The F ^ for 

l2 


each interval of each row is placed in small type in the lower 
right corner of the interval. These summed for each row give 


Nr 


the 



} shown in the column with that title heading. 


From 


1 

these are obtained the Cr for each row, by the same procedure 
as that used for finding the column means. For each row, the 




, / 

/ Nr 

/ 


is then computed and entered in the column 


with the title {iy/Nr» The sum of these for all row frequency 
distributions (say I rows) constitutes the aggregate 



436.2445 


This is the value required by Eq. (7) for finding 7721 . Thus,^ 


436.2445 - 


^21 


V2I 


(57r 

81 


4,3 - 

436.2445 - 40.1111 
493.000 - 40.1111 
0.87467 
0.9352 


396.1334 

452.8889 


The Correlation Ratio and Analysis of Variance. The square 
of the correlation ratio is a measure of the proportion of variance 
due to correlation, in the same manner as it was indicated that 

‘ The values for F ^ = 67 and ^ F * = 493 are found in Table 34. 
In that table^the 4* is the same as the A% used in the present table. 
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the square of the coefficient of correlation is a measure of pro- 
portion of variance due to correlation. 

As has been explained, when expressed in the form rVf = (r|/, 
the square of the coefficient of correlation reveals itself as the 
proportion of the total variance that is due to correlation or 
association with Xa as measured by the line of regression of Xj 
bn X2. In a similar manner, = (r|., and likewise rjhcl = <ry,. 
The square of the correlation ratio thus describes the proportion 
of the total variance that is due to correlation as measured by 
the fluctuations in the means of the columns and rows. The 
standard deviation of the means of the columns squared is the 
variance that is due to correlation of Xi with X2 and similarly 
for the correlation of X2 with Xi. 

To demonstrate algebraically that „ it 

is necessfiry first to note that by definition the mean of the 
weighted means of the columns equal Xi. By definition, 


Nc 



and thus 

No 

NcXc = X Xi 
1 

which, if summed for all columns, becomes 

m m No 

( 8 ) 

1 11 

But 

m No 
1 1 


and hence, if Eq. ( 8 ) is divided by N, it is equivalent to 




(9) 


which was to be prqyed. 

If ^1, the mean the entire Xi distribution, is now selected 
as the arbitrary oiii|^, 

+ c. 


or 


Cc = :^c- 
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Also, when the mean of the entire distribution is selected as 
the tfirbitrary origin for each column, the standard deviation 
of the column is found by 





- Cl [xi = {X, - x^y] 

On substituting Ce = Xe — Xi and transposing, an expression 
for each column similar to the following will result: 

Nc 

2 a:? 

(a) = <rl + (X. - 

Multiplying the equation for each column by its Net respectively, 
will result for each column in 

Nc 

(b) X + NeiX, - X,)* 

1 

When the whole series, one for each column, of equations such as 

(b) are totaled, the following result is obtained: 

m Nc m m 

(c) X 2) + X 

11 1 1 

But, in this equation, 

m Nc N 

XXx! = = 

1 1 1 

and 

m 

1 

Moreover, by definition, the explained variance, that is to say, 
the variance of the means of the colunms about the weighted 
mean of these means, is as follows: 

= SXc(X« - Xi)* = X NeiXe - X,)* 

i 

Consequently, (c) may be written 

N4 = + N<r%. 
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or 


A = 


Substituting the value of < 7 *,., =.<rf — in Eq. (3) for the 


correlation ratio gives the, following: 

V12 ~ 


Similarly, it can be shown that 

♦.2 — 
V21 — 




cl 


( 10 ) 


( 11 ) 


^2 — ^2 ^2 
^Xr ^ 21<^2 


Calculation of Curvilinear Regression. To illustrate the statis- 
tical problem involved in curvilinear regression and the calcula- 


Table 36 . — Stocks, Production, and Imports op Cotton and Price 
OP Cotton Received by Producers in the United States 
StocJca at beginning of crop year plus year* a production plus net imports. 
Prices are deflated by United States index of wholesale prices for crop years. 


Year 

beginning 

Aug. 1 

Deflated 
average price, 
cents per pound 

Xi 

Stocks, 

production, and 
imports, ten 
billion bales 

Xi 

1920-1921 

13.47 

1.726 

1921-1922 

18.06 

1.480 

1922-1923 

22.63 

1.306 

1923-1924 

29.30 

1.274 

1924-1925 

22.63 

1.550 

• 1925-1926 

19.19 

1.805 

1926-1927 

12.92 

2.195 

1927-1928 

20.95 

1.711 

1928-1929 

18.71 

1.749 

1929-1930 

18.34 

1.755 

1930-1931 

12.13 

1.862 

1931-1932 

8.38 

2.378 

1932-1933 

10,30 

2.307 

1933-1934 

14.04 

2.162 

1934-1935 

15.76 

1.765 

1935-1936 

13.83 

1.815 

. 1936 - 1937 - 
1937 - 193 # 

14.48 

1.821 

10.29 

2.382 

1938-1939 

11.17 

2.383 
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tion of the correlation index /, data on cotton stocks, production, 
and imports compared with cotton prices,, 1920-1939, have been 
selected. They are shown in Table 36 and plotted in Fig. 114. 

The position of plotted bivariates in Fig. 114 suggests that a 
curve such as aa' might fit the data. The question of the type of 
curve fitted is of particular importance in curvilinear regression 
and accordingly three types will be discussed for illustrative 
purposes. 



Fio. 114. — Bivariate scatter diagram and fitted curve showing relationship 
between the price of cotton and the supply of cotton. 

Logarithmic Regression, The constant slope of a straight 
line depicts the fact that the change in Xi is constant for a 
given quantity of change in X 2 , and vice versa. The changing 
slope of a curve depicts the fact that change in Xi varies for 
different values of X 2 , and vice versa. One such curvilinear 
relationship between X\ and is as follows: 

XiXl = k ( 12 ) 

In Eq. (12) the varying manner in which Xi fluctuates with 
respect to X^ depends on the exponent b. If 6 is larger than 1, 
a small changp in X 2 must produce a large change in Xi because. 
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as the equation indicates, their product (when is raised to 
the h power) is constant. If 6 is equal to 1, the changes £a 
must be just proportionate (in an inverse manner) to the changes 
in Xt. If 6 is less than 1, the changes in X\ must be proportion- 
ately less than the changes in Xj. If such an equation is used 
to describe the line of regression of price of cotton on stocks 
and production of cotton, a very flexible price of cotton will 



Fig. 115. — The relationship of Fig. 114 in logarithmic form. 


result in a value of 6 larger than 1; a very inflexible price of 
cotton will result in a value of h less than 1. The nature of 
Eq. (12) assumes that the flexibility in the price of cotton remains 
the same regardless of stocks and production, because it sets up 
the hypothesis that the product equals a constant. 

If such an equation is assumed to be suitable for the problem 
in hand, the fitting of the curve of regression may be simplified 
by first transforming the equation to its logarithmic form, namely, 

t' ^ 

log Xi -b i» log X% = log h 


if log A: = o (13) 
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Figure 115 shows the effect of transforming the bivariate fre- 
querPcy distribution from original units to logarithmic units. 
The data plotted are the same as the data plotted in Fig. 114, 
except that, in Fig. 115, the X\ and scales refer to the log- 
arithms of X\ and X^, When the bivariate logarithms shown in 
the first two columns of Table 37 are plotted in Fig. 115, a straight 

Table 37. — ^Logarithms of United States Production, Stocks, and 
Imports of Cotton and of the Price of Cotton Received by Producers 
With columns for the squares of the logarithms and their cross 'products 
Xi = price of cotton 

X2 — stocks, production, and imports of cotton 


log Xi 

log Xi 

log Xi log Xi 

log* Xi 

log* Xi 

1 . 1294 

0.2370 

0.2677 

1.2755 

0.0562 

1.2567 

0.1703 

0.2140 

1.5793 

0.0290 

1.3547 

0.1159 

0.1570 

1.8352 

0.0134 

1.4669 

0.1052 

0.1543 

2.1518 

0.0111 

' 1.3547 

0.1903 

0.-2578 

1.8352 

0.0362 

1.2831 

0.2565 

0.3291 

1.6464 

0.0658 

1.1113 

0.3414 

0.3794 

1.2350 

0.1166 

1,3212 

0.2333 

0.3082 

1.7456 

0.0544 

1.2721 

0.2428 

0.3089 

1.6182 

0.0590 

1.2634 

0.2443 

0.3086 

1.5962 

0.0597 

1.0839 

0.2700 

0.2927 

1 . 1748 

0.0729 

0.9232 

0.3762 

0.3473 

0.8523 

0.1415 

1.0128 

0.3631 

0.3677 

1.0258 

0.1318 

1 . 1474 

0.3349 

0.3843 

1.3165 

0.1122 

1.1976 , 

0.2467 

0.2954 

1.4343 

0.0609 

1.1408 

0.:J589 

0.2954 

1.3014 

0.0670 

1 . 1608 

0.2603 

0.3022 

1.3475 

0.0678 

1.0124 

0.3769 

0,3816 

1.0250 

0.1421 

1.0481 

0.3771 

0.3952 

1.0985 

0.1422 

S - 22.5405 

5.0011 

5.7468 

27.0945 

1.4398 


line fits the points. Thus the logarithmic transformation has 
converted a curvilinear correlation problem into a simple linear 
correlation problem in which the Pearsonian coefficient of 
correlation is nogxi logXa and the line of regression of log Xi on 
log X 2 is as follows: 

log Xi — fnean of log Xi = ri^g xi log x, (log X 2 

O'!©* Xi 

^ — mean of log Jfa) 
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The equations of regression could be obtained in the above form 
and then transformed into their antilogarithmic form; but in 
this problem it is more convenient to find the regression equation 
directly from the least-squares equations. Accordingly, the 
regression statistics o and h of Eq. (13) may be calculated by 
using the following least-squares equations:^ 

S log Xi = Na + log 
2 log Xi log Xi = o2 log Xi + i>2 log* Xj 

Table 37 is a work sheet providing columns to calculate 2 log Xi, 
2 log Xi, 2 log Xi log Xj, 2 log® Xj, and 2 log® Xj, using the 
data of the cotton problem for which the raw data are found in 
Table 36. The first two columns of Table 37 show the logarithms 
of the price of cotton in the United States and of the stocks, pro- 
duction, and imports of cotton. The third column contains the 
cross products of the logarithms. The fourth and fifth columns 
contain the squares of the logapthms in the first two columns. 
The sums of the columns provide the values that are required to 
find the regression statistics o and b, for Eq. (13). 

Calculation of the regression of log Xi on log Xj: 

22.5405 = 19o + 5.00116 
5.7468 = 5.001 lo •+• 1.43986 

In order to solve, eliminate a by multiplying the second equation 
by 3.7992 and subtract it from the first, as follows: 

22.5405 = 19a + 5.00116 
22.8332 = 19o -|- 5.47016 
0.7073 = -0.46906 
6 = -1.5081 

Substituting this value of 6 in either of the equations will show 
that 

a = 1.5833 

Accordingly, the equation of logarithmic regression of log Xi 
on log Xi is as follows: 

log Xi = 1.5833 - 1.5081 log Xj 
which may be transfonned into antilogarithmic form as follows: 
XiX7‘‘»«‘ = 38.31 

i See p. 333. 
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Reciprocal Regression, Reciprocal regression is a special 
form^f the type of regression indicated by Eq. (12) ; for if 6 = 1, 
changes in Xi are related reciprocally to changes in Z 2 . In 



Fig. 116. — The relationship of Fig. 114 in reciprocal form. 

other words, the equation becomes 

XiZ, = A' or ^ = k'X^ (14) 

which, placed in a more general form, is as follows: 

= a + bXi (15) 

If the reciprocal of each is found, it is possible to find the 
equation for the reciprocal regression by fitting a straight line 
to Xi and the reciprocal of Xi, that is to say, by fitting an equa- 
tion such as (15). Figure 116 shows the effect of transforming 
one of thecvariables of the bivariate frequency distribution from 
original units to reciprocal units. In the figure the vertical 
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scale is \/Xi while the horizontal scale remains X 2 . When the 
bivariates shown in Table 38 are plotted in Fig. 116, a straight 
line fits the points. Thus the reciprocal transformation Jias 
converted a problem in curvilinear correlation into a problem 
in simple linear correlation in which the Pearsonian coefficient 
of correlation is ri and the line of regression is as follows: 





<T\ 


= ri 

~ Xt 


<^2 


(X2 - Xi) 


Table 38. — United States Supply of Cotton and the Reciprocal op 
THE Price op Cotton Received by Producers 
With columns for the squares and the cross products 
Xi = price of cotton 
Xa = supply of cotton 


Xt 

1 

Xx 


Xa* 

1 

Xi* 

1.726 

0.07424 

0.12814 

2.97908 

0.00551 

1.480 

0.05537 

0.08195 

2.19040 

0.00307 

1.306 

0.04419 

0.05771 

1.70564 

0.00195 

1.274 

0.03413 

0.04348 

1.62308 

0.00116 

1.550 

0.04419 

0.06849 

2.40250 

0.00195 

1.805 

0.05211 

0.09406 

3.25803 

0.00272 

2.195 

0.07740 

0.16989 

4.81803 

0.00599 

1.711 

0.04773 

0.08167 

2.92752 

0.00228 

1.749 

0.05345 

0.09348 

3.05900 

0.00286 

1.755 

0.05453 

0.09570 

3.08003 

0.00297 

1.862 

0.08244 

0.15350 

3.46704 

0.00680 

2.378 

0.11933 

0.28377 

5.65488 

0.01424 

2.307 

0.09709 

0.22399 

5.32225 

0.00943 

2.162 

0.07123 

0.15400 

4.67424 

0.00507 

1.765 

0.06345 

0.11199 

3.11523 

0.00403 

1.815 

0.07231 

0.13124 

3.29423 

0.00523 

1.821 

0.06906 

0.12576 

3.31604 

0.00477 

2.382 

0.09718 

0.23148 

5.67392 

0.00944 

2.383 

0.08953 

0.21335 

5.67869 

0.00802 

2 « 35.426 

1.29896 

2.54365 

68.23983 

0.09749 


The equations of regression could be obtained in the above form 
and then transformed into their original units, but in this 
problem it is more convenient to find the regression equation 
directly from the least-squares equations. The normal least- 
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squares equations are as follows: 

i = iVa + 6 ^ X* 

Table 38 is a work sheet with columns in which the required sums 
are obtained. Entering these sums in the above least-squares 
equations makes it possible to evaluate the regression statistics 
a and b for Eq. (15). 

Calculation of the regression of 1/Jfi on X 2 : 

1.29896 = 19a + 35.42606 
2.54365 = 35.4260a + 68.239836 

Multiplying the first equation by J. .^645263 and subtracting the 
result from the second equation eliminates a and gives a solution 
for 6 as follows: 

6 = 0.05564 

Substituting this value in either equation gives the solution of a 
as follows: 

a = -0.03538 

The equation of regression is therefore as follows: 

4- = -0.03538 + .05664X2 

A 1 

This equation describes the straight line plotted in Fig. 116. 
Plotted on scales of Xi and X 2 , the equation is a curve. 

Parabolic Regression. The curvilinear relationships so far 
considered have been relationships that could readily be trans- 
formed to a linear form, by taking logarithms or reciprocals. 
Such transformations reduced the problem to one of simple 
linear correlation between the transformed variables, and there 
was little in the analysis that was different from that of the 
previous chapters. A curve that cannot easily be transformed 
to a linear form is the parabolic relationship 


== a + biXl + b^Xl 


(16) 
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This must be fitted directly. Fortunately, the nature of 
the curve is such that the method of least squares can be ^used. 
According to this, to fit a parabolic regression the least-squares 
equations are obtained as follows: 

The least-squares criterion is that 

= S(Xi — X[y = minimum 
or 

X(Xi — a — 61X2 — 62X2) = minimum 

For this to be a minimum its total differential should be equal 
to zero; that is, differentiating with respect to a, fei, and 62 and 
setting equal to zero, the following normal equations are obtained: 

SXi = Nd ”h 612X2 "h 622X2 

2 X 1 X 2 == a2X2 + 6i2X| + 622 X^ 

2XiX^ = a2X^ + 6i2X^ + 622 X^ 

Table 39 is a work sheet providing for the calculation and 
checking of the sums entering into the three parabolic equations 
of regression. Using the sums of the appropriate columns 
the following set of equations is obtained for the calculation of 
the regression statistics a, 61, and 62 for the regression of Xi on X2, 
shown in Eq, (16): 

306.58 = 19a + 35.4266i + 68.239862 (l) 

542.7359 = 35.426a + 68.23986i + 135.474462 (II) 
994.4092 = 68.2398a + 135.47446i + 276.397462 (III) 

The solution of three equations for three unknowns should be 
undertaken in an orderly manner; this is attempted in Table 40, 
which is a work sheet following the so-called Doolittle method. 
This work sheet provides a step-by-step check on the calculations 
as the solution of the equations proceeds. In order to avoid 
copying a, 61, and 62 each time an equation is written down, a, 61, 
and 62 are Written as the titles of columns in which their coeffi- 
cients are entered. In the table only the coefficients are entered 
in their respective columns with the proper sign before each figure. 
For example, row (1) of the table is presumed to read as follows: 

19a + 35.42W61 + 68.239862 - 306.58 = 0 

which is the first equation above with slight rearrangement of 
terms. 



Table 39. — United States Supply op Cotton and the Price of Cotton Received by Producers 
With columns for the second, third, and fourth powers and for the necessary cross products to fit parabolic regressions 

Xi = price of cotton 

• Xi — supply of cotton 
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Table 40. — ^Doolittle Work Sheet for Calculating Three Regression Statistics for Curvilinear Correlation 

Regression of X\ on Xj 



* 8.2395 * 136.4744 - 127.23496. 

Row (11) is used to find ht; row (6) is used to find bi after 62 is found; row (2) is used to find a after 62 and 61 are found, from the equations 
Row (11), -1.000062 + 7.9944 = 0 

Row (6), -I.OOOO61 - 3.7673362 - 13.2095 = 0 

Row (2). -l.OOOOo - 1.86452661 - 3.5915762 + 16.1358 = 0 
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Three steps are involved in solving three equations for three 
unkifowns: (1) to get an equation in the three unknowns in which 
the coefficient of one of the unknowns is unity, (2) to get an 
equation in only two of the unknowns in which the coefficient 
of one of the two is unity, and (3) to get an equation in only one 
of the unknowns in which its coefficient is unity. When the 
third step is accomplished, the value of the third unknown is 
obtained. This value, applied in the equation obtained by the 
second step, makes it possible to evaluate the second unknown; 
and the third unknown is then obtained by applying these two 
values in the equation obtained by the first step. This is the 
same process as that used for finding two unknowns from two 
equations. 

Table 40 provides an orderly procedure and also a check 
for these steps. The first step is accomplished in row (2) of 
the table, by multiplying Eq. (I), copied in row (1), by the 

negative reciprocal of the coefficient of a, that is, by this 

will make the coefficient of a become —1. The second step, 
rows (3) to (6), eliminates a from two of the equations in order 
to obtain in line (5) an equation in bi and 62. In order to 
eliminate a, the first equation must be divided by its own coeffi- 
cient of a and multiplied by the coefficient of a of Eq. (II); in 

other words, Eq, (I) must be multiplied by The 

multiplier is given a negative sign so that, when added to Eq. (II), 
the a term will cancel. Row (6) divides row (6) by the negative 
reciprocal of the coefficient of 61 in row (5). The third step, rows 
(7) to (11), accomplish the elimination of two of the variables, 
ending with an equation in only one of them, which of course gives 
its value. In order to do this, Eq. (Ill) is copied in row (7); 
Eq. (I) is multiplied by a number that will give it a coefficient 

— 68.2398 

of a equal to the coefficient of a of Eq. (Ill), that is, by jg j 

and this is entered in row (8); then the equation obtained in 
row (6) (in terms of only 61 and 62, ca having been eliminated) is 
multiplied by a number that, combined with the two coefficients 
of 61 in rows (7) and (8), will give a sum of zera The sum of 
rows (7) td (9) will then eliminate both a and 61, giving in row (10) 
such an equation. When row (10) is multiplied by the negative 
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reciprocal of its coefficient of 62, the value of 62 is obtained; this 
is shown in row (11). ^ * 

A column for sums is provided in order to obtain a step-by- 
step check on all calculations. This is done by applying to 
the sums the same multipliers as those applied to the equations. 

For example, the sum of row (1) multiplied by should equal 

the sum of row (2). In the column headed Checks are entered 
the products obtained by multiplying the sums as indicated 
under Remarks to visualize the checks. 

From Table 40, the values of a, fei, and 62, are obtained from 
equations in rows (2), (6), and (11), as follows: 

Row (2), -a - 1.8645266i - 3.5915762 + 16.1358 = 0 
Row (6), -61 - 3.7673362 - 13.2095 = 0 

Row (11), -62 + 7.9944 = 0 

62 = 7.9944 

6i = -3.76733(7.9944) - 13.2095 
= -43.327 

a = 43.327(1.864526) - 7.9944(3.59157) + 16.135789 
= ,68.2077 


The equation of regression of Xi on X2 is therefore as follows: 

Xi = 68.2077 - 43.327X2 + 7.9944X* 

Estimates Based on Regression Equations. Using the three 
equations of regression calculated above for the regression of 
Xi on X2, that is, for the regression of the price of cotton on 
production, stocks, and imports of cotton in the United States, 
estimates may be made of the price that will result from a given 
volume of stocks plus production plus imports. The three 
equations are as follows: 

Logarithmic regression, logXi =«= 1.5833 — 1.5081 logX2 
Reciprocal regression, == -^0.03538 + 0.05564X2 

Parabolic regression, Xi 68.2077 — 43.327X2 + 7.9944Xi 

To illustrate the method of estimation, suppose the questions 
are asked: What is the expected price of cotton if the cotton 
stocks plus the year’s produc|ion and imports amount to 25 mil- 
lion bales? What is the expected price of cotton if, the cotton 



NONLINEAR CORRELATION 389 

stocks^ plus the year’s production and imports amount to 22 
million bales? 19 million bales? 16 million bales? 13 million 
bales? Only 10 million bales? How much higher will the 
price be in a year of shortage than in a year of large carry-over 


TablIi 41. — Estimates op Cotton Prices Based on Three Regression 

Curves 

Estimatea based on logarithmic regression 


Values of 
Xt 

lug X% 

Equation of estimate 

1.6833 - 1.5081 log Xt * log Xi 

log Xi 

Estimate of 
Xi 

2.5 

0.39794 

1.5833 - 1.5081(0.39794) = 

0.98317 

9.62 

2.2 

0.34242 

1.5833 - 1.5081(0.34242) = 

1.06690 

11.67 

1.9 

0.27875 

1.5833 - 1.5081(0.27875) = 

1 . 16292 

14.55 

1.6 

0.20412 

1.5833 - 1.5081(0.20412) = 

1.27547 

18.86 

1.3 

0.11394 

1.5833 - 1.5081(0.11394) = 

1.46612 

29.25 

1.0 

0.00000 

1.5833 - 1.5081(0.00000) = 

1.58330 

38.31 


Estimates based on reciprocal regression 


Values of 

Xi 

Equation of estimate 

-0.03538 + 0.05564X* » •— 

Xi 

1 

Xi 

Estimate of 
Xi 

2.5 

-0.3538 + 0.05564(2.5) = 

0.10372 

9.64 

2.2 

• -0 . 3538 + 0 . 05564(2 . 2) = 

0.08703 

11.49 

1.9 

-0.3538 + 0.05564(1.9) =' 

0.07034 

14.22 

1.6 

-0.3538 + 0.05564(1.6) = 

0.05364 

18.64 

1.3 

-0.3538 + 0.05564(1.3) = 

0.03695 

27.06 

1.0 

-0.3538 + 0.05564(1.0) = 

0.02026 

49.36 


Estimates based on parabolic regression 


Values of 

Xi 

Equation of estimate 

68.2077 - 43.327X2 + 7.9944X2* = Xi 

Estimates of 
Xi 

2.5 

68.2077 - 43.327(2.5) + 7.9944(6.25) = 

9.85 

2.2 

68.2077 - 43.327(2.2) + 7.9944(4.84) = 

11.58 

1.9 

68.2077 - 43.327(1.9) +7.9944(3.61) - 

14.75 

1.6 

68.2077 - 43.327(1.6) +7.9944(2.56) = 

19.35 

1.3 

68.2077 - 43.327(1.3) +7.9944(1.69) - 

25.39 

1.0 

68.2077 - 43.327(1.0) +7.9944(1.00) = 

32.87 


and large production and imports of cotton? Table 41 shows 
how these estimates are made by using the above three equations 
of regression. When the year’s cotton stocks, production, and 
imports ai*e Isyge, the three regression equations give results 
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that are approximately equal to each other; but when the yearns 
cotton stocks, production, and imports are small, the estimates 
based upon the three regression equations differ sharply from 
each other. 

First-order Standard Deviation Used as Standard Error of 
Estimate, The dispersion about a curve of regression can be 
measured in the same manner as the dispersion of cases about a 
progression of means or a line of regression. The measure 
generally used is the standard deviation and is called a ‘‘first- 
order standard deviation^' or a “standard error of estimate,'* 
because it is the standard deviation of the residuals about the 
curves of regression by means of which estimates such as those 
illustrated in Table 41 are made. 

For the illustration in which cotton stocks, production, and 
imports are correlated with cotton prices compared with cotton- 
price correlation, three types of regression lines have been fitted, 
as follows: 

log Z'l = a + 6 log X2 (A) 

+ (B) 

XI = a + 61X* + 6*Xi (C) 

The standard error of estimate, being a standard deviation, is 
defined as follows: 

= Sd* (17) 

where each d is defined, taking regression type (C), for example, 
as 

d = Xi - XI = Xi - o - &1X2 - 62X* (18) 

Hence, each d* will be as follows: 

d* = d(o - 61X2 - 62X2) = dXi - od - 6iX2d - 62Xid (19) 

If all these d*’s are added, the following result is obtained: 

2d* = 2Xid - o2d - biZXjd - 622X|d (20) 

By the least-squares condition, however, the last three terms 
of Eq. (20) are equal to zero, for‘ 

2d = 2(Xi - a- biXi - 6iX|) = 0 
2X2d = 2X2 (Xi - a- 6iX* - 62X^) =,0 
2X|d = 2 X**(Xj -a- biXi - 62X*) = 0 

‘ See p. 884. ,, 
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Therefore, Eq. (20) reduces to the following: 

2d* = 2Zxd = 2Zi(Xi -a- hxXt - b^Xl) 
= 2X? - a2Xi - bi^XiXi - bi-LXtXl 


( 21 ) 


Accordingly, the formula for the square of the standard error of 
estimate is as follows: 


2X? - a2Zi - fciSZiXs - b^-LX^Xl 
N 


( 22 ) 


If regression type (B) were taken, it can be shown similarly 
that 


0'l.2 




(23) 


If the logarithmic regression equation is chosen, the standard 
error of estimate is found by a similar procedure to be as follows: 
, 2 log* Xi — a2 log Xi — 62 log Xi log Xi 


' 1.2 


N 


(24) 


The values necessary to- calculate these standard errors of 
estimate are available, respectively, in Tables 40, 39, and 38. 

Calculation of standard error of estimate: For the logarithmic 
regression:^ 


27.0945 - (1.5833)(22.5405) - (- 1.5081)(5.7468) 

19 

27.0945 - 35.6884 + 8.6668 _ 0.0729 
19 19 


= 0.0038 
o'i.2 = 0.06164 


By using the ordinary formula for the standard deviation, 

, 2X* fxxY 

(when the arbitrary origin is taken as zero), the necessary figures 
are found in totals of the appropriate columns of Table 37, and 


^ The scatter formula for the logarithmic regression could be calculated 
by using the formula employed in the linear case, as follows: 

, <^ 1.2 ~ ”■ f\oKXi\ozXi) 

Since, however, the logarithmic r has not been calculated, it is simpler to 
use the formula b^sed on the least-squares equations. 
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it is found that ^ 

< t \ = 0.0187 
<ri = 0.1368 

For the reciprocal regression 

, _ 0.09749 - (-0.03638)(1.29896) - (0.05564) (2.54365) 

i9 

_ 0.09749 + 0.04596 - 0.14153 _ 0.00192 
19 19 . 

= 0.000101 
^ 1.2 “ 0.01 

The standard deviation of 1/Xi is found by using the following 
formula 



The necessary values are found in the sums of the appropriate 
columns of Table 38. 

= 0.00869 

Xi 

= 0.0932 

Xi 

For the parabolic regression: 

6,461.3758 - 68.2077(306.58) - (-43.327) (542.7359) 

, -(7.9944) (994.4092) 

«^ 1.2 - 19 

_ 5,451.3758 - 20,911.1167 + 23,515.1183 - 7,949.7049 

19 

_ 105.6725 
19 

= 5.5617 
O’ 1.2 2.3583 

Using the ordinary formula for calculating the standard deviation 
when zero is taken as the arbitrary origin, 

<rl = 26.5608 <ri = 6.1528 

1 The scatter formula for the reciprocal regression could be calculated by 
using the formula for the linear case, as follows: 

** <^i[l 1 \ y I 

\Xt) ' 

Since, however, the reciprocal r has not been calculated, it is simpler to use 
the formula based on the least-squares equations. 
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Table 42 is a summary of the estimates of cotton prices made 
above,* together with ranges of plus and minus one standard 


Table 42. — Estimates and Ranges op Twice the Standard Error op 
Estimate for Cotton Prices Based on Three Regression Curves 
Estimates and ranges^ logarithmic regression 


Estimated 
log of price 
log Xi 

Range 

logarithms 

j Estimated 
price 

Xi 

# 

Range of 

price, antilogarithms 

log X\ + cri .2 

log Xi — <ri,j 

0.98317 

1.04481 

0.92153 

9.62 

11.19 

8.35 

1.06690 

1 . 12854 

1.00526 

11.67 

13.45 

11.29 

1 . 16292 

1.22456 

1 . 10128 

14.55 

16.77 

12.63 

1.27547 

1.33711 

1.21383 

18.86 

21.73 

13.23 

1.46612 

1.52776 

1 .40448 

29.25 

33.71 

25.38 

1.58330 

1.64494 

1.52166 

38.31 

44.15 

33.24 


Estimates and ranges^ reciprocal regression 


Estimated 
reciprocal 
of price 

Range 

reciprocals 

Estimated 

price 

Xi 

Range of estimated 
price, converted from 
reciprocals 

1 , 

1 

T, - 

0.10372 

0.11372 

0.09372 

9.64 

8.79 

10.67 

0.08703 

0.09703 

0.07703 

11.49 

10.31 

12.98 

0.07034 

0.08034 

0.06034 

14.22 

12.45 

16.57 

0.05364 

0.06364 

0.04364 

18.64 

15.71 

22.91 

0.03695 

0.04695 

0.02695 

27.06 

21.30 

37.11 

0.02026 

1 

0.03026 

0.01026 

49.36 

33.05 

97.96 


Estimates and ranges^ parabolic regression 


Estimated 

price 

Xi 

Standard error of 
estimate 

Xx + 

Xx - aXA 

9.85 

12.21 

7.49 

11.58 

13.94 

9.22 

14.75 

17.11 

12.39 

19.35 

21.71 

16.99 

25.39 

27.75 

23.03 

32.87 1 

35.23 

30.51 


error of estimate. In the cases of the logarithmic and reciprocal 
nsgressions these ranges are converted into original units of 
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the data in order to show their significance. The differences 
are notable. For the lower levels of price, the reolprocal 
regression gives estimates with small standard errors of estimate, 
but for the higher price levels the standard error of estimate is 
smallest with the parabolic regression. Each of these methods of 
calculating regression curves assumes that the variance in Xi 
is the same for all subgroups of Xi associated with varying 
values of X2. The logarithmic regression assumes that, when 
converted into logarithms, the variance about the logarithmic 
regression is equal at all points but that, when converted into 
antilogarithms, it will be larger for the higher prices. The 
reciprocal regression assumes equal variance about the curve in 
terms of reciprocals but, when converted, the variance about the 
higher prices is larger than the variance about the lower prices. 

The question suggests itself : Which one of these three assump- 
tions about the character of variance about the curves of regres- 
sion best suits the data of the particular problem? This question 
is answered by determining which of the regression curves is the 
best fit for the data in question. 

Correlation Index. For each of the curves of regression 
calculated in the previous section, a corresponding index of 
correlation will help to determine which of the regression curves 
is the best fit for the data. The standard error of estimate 
measures the divergence of the bivariates from the curve of 
regression; the correlation index measures the goodness of fit 
of the curve of regression. The indexes of correlation may be 
calculated by using Eq. ( 2 ). 

Calculation of Indexes of Correlation: For the logarithmic 
regression: 

72 _ ^1.2 __ 0.0187 - 0.0038 _ 0.0149 

** < t \ 0.0187 0.0187 

= 0.7968 
J12 = 0.8926 

For the reciprocal regression: 

rt _ 0.00869 - 0.00010 _ 0.00859 
” 0.00869 0.00869 

* 0.9885 . 

/is = 0.9942 
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For the parabolic regression : 

• _ 26.5508 - 5.5617 _ 20.9891 

26.5508 26.5508 

= 0.7905 
7i2 = 0.8891 

The high correlation index obtained for the reciprocal regres- 
sion appears to indicate that the cotton supply and price data 
for the period 1900 to 1940 are correlated in a reciprocal manner. 
It indicates that the sample data are fitted by the reciprocal curve 
of regression better than by either the logarithmic curve or the 
parabolic curve. 

It is to be noted that, in general, the use of the index of correla- 
tion to show which curve is the best fit is valid only when all 
curves have the same number of regression statistics. Here two 
curves had two regression statistics and one had three. A curve 
with a larger number of regression statistics will always give a 
better fit than a similar curve with a smaller number of regression 
statistics. Here, however, the parabola that had three regres- 
sion statictics gives a worse fit than either the logarithmic 
or the reciprocal curve, each of which has only two regression 
statistics. 

The Index of Correlation and Analysis Variance, As already 
pointed out, in the cases of the logarithmic and reciprocal curves 
of regression, the Pearsonian coefficient of correlation may be 
calculated. When transformed into original units, this coefficient 
of correlation becomes the index of correlation. In the problems 
above illustrated, however, the correlation index was calculated 
instead by using the general formula based upon the scatter 
because the arithmetic involved in the latter method is simpler. 
In logarithmic and reciprocal units, respectively, the coefficient 
of correlation squared is, for these curves of regression, a coeffi- 
cient of proportional variance just as is the for simple linear 
correlation problems. 

For the parabolic curve of regression, the deviations from the 
curve of regression may be described as 

- Z; = d and Zi = Zi - d 
If these are g,dded for the entire data, the result is 
SZ; = 2Zi - Sd 
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and since 2d = 0 it follows that 


V^jjl^TES 


2X; = 2Xi = N2i 


and hence the mean of X[ equals the mean of Xi. 

Consequently, the sum of squares of X'l may be obtained as 
follows: 

= 2X1 - JV-^i (26) 

In Eq. (26), 2XJ* may be evaluated as follows: 

2X1* = 2(Xi - d)* = 2X? - 22Xid + 2d* 


As shown above on page 391, 2Xid = 2d*. Therefore, 

2Xi® = 2Xf - 2d* 

and 

= 2X? - 2d* - (26) 

However, it is true by definition that 

2X? - XX? = Nffl and 2d* = 

Therefore, Eq. (26) reduces to the following: 

Nal,, = Nffl - Nfflt (27) 

Prom Eq. (27), by dividing by N and then by v* and transposing 
terms, it follows that 


<^i 


= 1 


Ojj 

<^1 


(28) 


and from Eq. (28) it follows by definition of that 

Ih = # (29) 

<^l 

Hence the square of the correlation index has the same significance 
as the square of the linear coefficient of correlation; it measures 
the proportion of the total variance accounted for by the assumed 
type of curvilinear correlation. 



CHAPTER XVI 


MULTIPLE AND PARTIAL CORRELATION 

To deal with the relationship between only two variables 
the method of correlation so far discussed is useful, but in the 
nonexperimental sciences it is frequently and indeed usually 
more important to be able to deal with the association between 
three or more variables. In the social sciences in particular, 
variations in practically every factor are related to variations 
in several other rather than in a single other factor^ For exam- 
ple, variations in the price of cotton are related not only to 
changes in the production and consumption of cotton but also 
to changes in the prices of substitutes for cotton such as rayon 
and, in addition,^ to changes in the value of money. Again, the 
consumption of a commodity such as gasoline may depend more 
upon the number of automobiles in existence and upon the 
number of miles of hard-surfaced roads available for use than 
upon the price of gasoline. As a matter of fact, it is dependent 
on all these factors and others too. In such cases it is essential 
to have some method of ^‘multiple correlation'' and '^partial 
correlation." 

Definitions of Terms, Multiple Correlation, Multiple corre- 
lation is an extension to more than two variables of the methods 
of simple correlation. Simple linear correlation provides a line 
of regression from which an average value for the depend- 
ent variable may be estimated if the value of the independ- 
ent variable is given. Multiple linear correlation provides a 
plane" of regression by means of which an average value for 
the dependent variable may be estimated if the values of 
two or more independent variables are given.# The plane of 
regression of the price of cotton on the price of rayon and on the 
wholesale price level, for example, would permit the estimation 
of the forme! from joint knowledge of the latter, instead of from 
the price of rayon alone. Similarly, the plane of regression 
of the second-semester English grade on the first-semester English 
• ' 397 
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grade and on the verbal scholastic-aptitude test grade would 
permit the estimation of the former from joint knowledge of the 
latter, instead of from only the first-semester English grade. 
The regression equation, accordingly, has two or more terms 
to the right instead of one; its general form is as follows: 

X[ = ai.23 . . . + &12.8 . . .^2 + fcl8,2 . . .Xz + • • • 

where Xi is the dependent variable, X2, Xz, etc., are the inde- 
pendent variables, and a and the 6 ^s are estimated parameters, or 
regression statistics, whose numerical values are determined in 
any particular case by the method of least squares. The numer- 
ical subscripts will be explained later. For the moment it only 
need be noted that a plane of regression is but the extension to 
more than two variables of the idea of a line of regression. 

In simple linear correlation, dispersion about the line of regres- 
sion of Xi on X2 serves as a measure of the accuracy of any 
estimate of Xi made from the line of regression. In multiple 
correlation, dispersion about the plane of regression serves as a 
measure of the accuracy of any estimate of the dependent variable 
made by reference to the plane of regression. One of the essential 
problems of multiple correlation is te calculate dispersion about 
the plane of regression. 

In simple correlation, a line of regression is merely a law of 
relationship between one variable taken as a dependent variable 
and another taken as an independent variable; it does not of 
itself describe the degree of relationship or association that exists. 
To measure the degree of linear association is the function of the 
coefficient of correlation. Since the coefficient of correlation 
measures the amount of linear association, it also serves as a 
measure of the goodness of fit of the linear-regression equation 
to the bivariate distribution and yields a measure of the general 
degree of accuracy of estimates made by reference to the regres- 
sion equation. In multiple correlation, the coefficient of multiple 
correlation serves the same general function. First, it serves 
as a measure of the degree of association between one variable 
taken as the dependent variable and a group of other variables 
taken as thb independent variables. Hence, it alsp serves as a 
measure of the goodness of fit of the calculated plane of regression 
and consequently as a measure of the general degree of accuracy 
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of estimates made by reference to the equation for the plane of 
regression. . 

In simple linear correlation, relationships are completely 
described by two lines of regression, one in which Xi is taken 
as the dependent variable and the other in which X2 is the 
dependent variable. In multiple correlation involving three 
variables, there are three planes of regression. If four variables 
are involved, there are four planes of regression, and so forth. 
In general, there are as many planes of regression as there are 
variables that may be taken as dependent variables, in short, as 
many planes of regression as variables. In particular cases, the 
intuitive sense of cause and effect may lead to the rejection of 
some of these possible planes of regression as being without any 
practical significance. They must always, however, be consid- 
ered as theoretical possibilities. 

Where only two variables are considered, the coefficient of 
correlation between X2, taken as dependent, and Xi, taken as 
independent, is the same as the coefficient of correlation between 
Xi, taken as dependent, and X2, taken as independent. The 
measure of goodness of fit of the line of regression of X2 on Xi 
is the same as the measure of the goodness of fit of the line of 
regression of Xi on X2. This cannot be said of the various 
multiple-correlation coefficients. The multiple-correlation coeffi- 
cient that measures the degree of association between Xi, 
dependent, and X2 and Xa, independent, as a group and that also 
serves as a measure of the goodness of fit of the plane of regression 
of Xi on X2 and Xsi s not the same a s the coefficient of multiple 
correlation that measures the degree of association of X2, depend- 
ent, with Xi and Xa, independent, taken as a group and that 
also measures the goodness of fit of the plane of regression of X2 
on Xi and X3. Furthermore, neither of these two coefficients is 
equal, except by mere chance, to the coefficient of multiple 
correlation that measures the degree of association of Xa, depend- 
ent, with Xi and X2, independent, taken together and that also 
measures the goodness of fit of the plane of regression of Xa on Xi 
and X2. In multiple correlation, there are as many different coef- 
ficients of multiple correlation as there are planes of regression. 

Linear vs.* Nonlinear Relationships. The simplest form of 
correlation analysis rests on the assumption that the association 
between tfie vaaiables is of a linear type. In some cases, this 
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assumption does violence to the facts, the association being 
clearly of a nonlinear form. Where a simple form of nonlinear 
relationship exists between two variables, it has been found 
possible to fit a curve of regression instead of a line of regression 
and to calculate a correlation coefficient that measures the good- 
ness of fit of this curve. Whether such a simple curve can be 
fitted or not, it is possible to calculate a measure of nonlinear 
relationship, called the ‘'correlation ratio, that depends on a 
comparison of the variation about the means of the columns 
(or rows) of the grouped data with the total variation in the 
data.^ 

Such devices as these can also be used when nonlinear rela- 
tionships exist among three or more variables. When the 
nonlinear relationship takes a simple form, it is possible to fit 
a curved plane or a surface of regression. A multiple-correlation 
index I1.28 can also be calculated to serve as a measure of the 
goodness of fit of this surface of regression. Whether a simple 
form of a curved surface can be fitted or not, it is always possible 
to calculate a multiple-correlation ratio of the same sort as the 
correlation ratio for only two variables. Similar nonlinear 
relationships can also be carried over into the analysis of partial 
correlation. 

Partial Correlation, Partial correlation is concerned with a 
concept resulting from the fact that more than two variables 
are correlated; if only two variables are considered, there is no 
place for partial correlation. Where there are three or more 
variables, however, the question of the interrelationships between 
the variables becomes a part of the analysis. How much of the | 
apparent association between two variables (Xi and X2) is due 
to their common association with a third variable (X3) and how 
much to their direct connection or to some connection through 
other variables independent of Zs? Would Xi and X2 continue 
to vary together if Xz were held constant? This is the new 
problem that partial correlation attempts to solve. Fortunately, 
the methods employed in its solution are the same fundamentally 
as those involved in simple linear correlation. 

This chapter is primarily concerned with linear multiple and 
partial correlation involving three variables. The notation 
involved in multiple and partial correlation will first be sum- 

1 See Chap. XV. » 
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marized, A brief discussion of a multivariate frequency dis- 
tribution, upon the basis of which any form of multiple or partial 
analysis must be based, will follow. Ensuing sections of the 
chapter will explain the fitting of planes of regression and will 
derive formulas for finding the numerical values of the regression 
statistics of any given plane fitted by the method of least squares. 
Formulas for measuring dispersion and for calculating multiple- 
correlation coefficients will also be derived. Partial correlation 
will be explained in more detail, and methods of calculating 
partial-correlation coefficients will be indicated. In the next 
chapter the entire subject will be illustrated by an example. 

Notation. It is the practice in multiple- and partial-correla- 
tion analysis to let a symbol indicate the class to which a given 
quantity belongs and to denote by subscripts the particular 
number of the designated class. For example, if X stands for 
any variable measured in original units, Xi indicates a particular 
member of this group and its subscript distinguishes it from 
A2, X3, etc., which ar^ members of other groups. In a designated 
problem, Xi may be the price of cotton, X2 the price of rayon, 
and Xa the general price level. Following is a summary of the 
various symbols used in the subsequent analysis, in which special 
attention should be directed to the subscripts: 

Xi, X2, Xz Variables measured in original units 
X[, Xg, Xj The estimated value of these variables given by the 
three regression equations in which the variables 
are taken as dependent. The primes distinguish 
them from the actual values of Xi, X2, and X3 
Xi, X2y Xs Variables measured from their means as origins 
(xi = Xi Xi, etc.) 

^3 The estimated values of X\, x^, and Xz given by their 
regression equations and measured from the 
means of Xi, X2, X3 {x[ = X( — Xi, etc.) 

Xi, X2, Xs Means of Xi, X2, X3 

<^2, 0-3 Standard deviations of Xi, X2, X3 
^ ^ Variables measured from their means as origins 

O'/ (^2 0*3 and expressed in terms of standard-deviation 
units 

£1^X3 xj, Xz expressed in terms of the standard-devi- 

0-1 <r2 (Tz ation units of Xi, X2, X3 
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X[ = ai.23 + 6 i2.8-X’2 + &I 8.2X3 

X2 ** ® 2.18 "1“ h%i.zXi + 628. iXz 

•^8 ~ ^*12 ^ S 1 . 2 * 3 li + 682 . 1 - 3 l 2 

These are the equations for the three planes of 
regression in which the variables are measured 
in terms of original units. The a's and 6's are 
the regression statistics of the equations, of 
which the explanation follows: 

ai.23 The constant term in the regression equation in 

which Xi is taken as the dependent variable 
and X2 and Xs as the independent variables 

a2.i3 and 03.12 These are the constant terms, when X2 and 
X3, respectively, are the dependent variables. 
The subscript before the point refers to the 
dependent-variable number; the subscripts 
after the point refer to the independent vari- 
ables. The order of subscripts after the point 
is immaterial; that is, 02.13 = ^2.81 

612.8 The coefiSicient of X2 in the regression equation in 

which Xi is taken as the dependent variable 
and Xa is the other independent variable. The 
first number in the subscript indicates the 
dependent variable; the second number in 
the subscript indicates the variable of which the 
6 is a coefficient; the point followed by the 
other subscript indicates that a third variable 
is considered. Similarly, 613.2 is the coefficient 
of X3 in the same regression equation. It is to 
be noted that 612.3 ^ 621.3 

621.8 The coefficient of Xi in the regression equation in 

which X2 is taken as the dependent variable 
and Xz is the other independent variable; 
628.1 is the coefficient of Xz in the same regres- 
sion equation 

632.1 and 631.2 These have a similar meaning for the third 
regression equation 

x [ = 612.3^2 + 613.2:^3 

Xz ^ hn.^Xl + 628 . 1^8 
6ai.2»i + hz^,xXz 
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Equations (2) are another form of the three regres- 
• sion equations. Here the variables are ex- 

pressed in terms of deviations from their 
respective means. In these equations there 
are no a's, or constant terms, because the planes 
of regression all pass through the point given by 
the means of the three variables. The are 
the same as those in Eqs. (1) 


- R 

— = Pl2. 

<ri 


Xz 

\ r P23.1 — 

(Tl (Tz 


— == i331.2 h /332.1 — 

CZ <^l <^2 



Equations (3) give a third form in which the three 
regression equations may be written. Here 
the variables represent deviations from their 
respective means expressed in standard-devi- 
ation units [the x'^s are expressed in terms of the 
standard deviations of the x^s (ai, 0 * 2 , <rz) 
instead of the standard deviations of the x'^s 

themselves]. The form is similar to — == r — 

<7i <r2 

for two variables^ 

In Eqs. (3), the /S’s correspond to the Vs in the 
Eqs. (1) and (2). As may be seen by compar- 
ing Eqs. (2) and (3), the /3^s are related to the Vs 
in the following way: 

612.3 = / 3 i 2.3 

az 

621.3 = fizi.s 

(Tl 

631.2 = j331.2 “ 

(Tl 

613.2 ~ Pu . z ~^ 

<TZ 

623.1 == Pz 3 , l ^ 

•9 

h»i.l = jSs*,! — 

‘ See pp. 34»-861. 
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If the symmetry of these equations is noted, 
they are easily remembered; for example, the 
subscripts of the b*s are the same as the sub- 
scripts of the and the order of the first two 
subscript numbers describes the subscript for 
sigma in numerator and denominator, respec- 
tively. It is to be noted that the 1812.8 does not 
equal Pi2.z, etc. 

<ri.2« The scatter about the plane of regression of on X2 
and X3 

0-2.18 The scatter about the plane of regression of ^2 on Xi 
and Xz 

0^3.12 The scatter about the plane of regression of X3 on Xi 
and X2 

Ri.2s The multiple-correlation coefficient between Xi on the 
one hand and X2 and Xs on the other 
Rz.iz The multiple-correlation coefficient between X2 on the 
one hand and Xi and Xz on the other 
1^3.12 The multiple-correlation coefficient between Xz on the 
one hand and Xi and X2 on the other 
ri2.3 The partial-correlation coefficient between Xi and X2 
when X3 is held constant. The position of the sub- 
scripts is more important than the noncapitalization of 
the r in distinguishing it from the multiple-correlation 
coefficients. The subscript after the point indicates 
which variable is held constant. r2i.8 = rn.z 
Viz. 2 The partial-correlation coefficient between Xi and Xz 
when X2 is held constant 

r23.i The partial-correlation coefficient between X2 and Xz 
when Xiia held constant 

j 

Study of the symmetry in the above system of notation will 
make it easy to remember. With the exception of the notation 
for partial-correlation coefficients, the order of subscripts before 
the point is always significant; following the point it is always 
immaterial. 

MULTIVARIATE FREQUENCY DISTRIBUTION 

The monovariate frequency distribution, it win be recalled, 
is the basis for the detennination of various measures describing 
the central tendency and variation about the 6entral tendency 



MULTIPLE AND PARTIAL CORRELATION 


405 


of a single variable. The bivariate frequency distribution 
(Chap# XIII) is the basis for the calculation of the lines of 
p^ression and the simple correlation r as well as for the calcu- 
^tion of correlation ratios. In fact, the bivariate frequency 
distribution contained all the information regarding the joint 
jtUriation or covariance of Xi and X 2 and hence formed the basis 
^for the calculation of any measure or law of relationship between 
these two variables, linear or otherwise. Similarly, the multi- 
variate frequency distribution contains all the information about 



Fig. 117. — A trivariate frequency distribution. 


the covariance of Xi, X 2 , Xs, etc., and it thus forms the basis for 
the calculation of any measure or law of relationship between the 
different variables, individually or in groups.- 

Figure 117 shows a trivariate frequency distribution in which 
each variable is grouped into three class intervals. A small 
number of class intervals is taken in order to simplify the dia- 
gram; in any actual problem, the number of class intervals would, 
of course, be larger. 

The figure shows the frequency (the number written on the 
floor of eaclf cubical cell) with which Xi falls within a given class 
interval at the same time that X 2 falls within another given class 
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interval and Xz falls within a third given clasa interval. Accord- 
ingly, in Fig. 117, the frequency with which Xi takes on.' values 
between 100 and 200 at the same time that X^ takes on values 
between 1 and 2 and Xz takes on values between 5 and 10 is 10. 
This is the frequency of the joint occurrence of the specified 
Xi, Xzi and Xz values. The frequencies in other cells repre- 
sent the frequency of joint occurrence of other Xi, Xu, and Xz 
combinations. 

If the frequencies of this trivariate frequency distribution are 
projected upon any one of the three reference planes, that is, if 
the frequencies are added from top to bottom, from left to right, 
or from front to rear, a bivariate frequency distribution is 
obtained for two of the three variables. For example, if the 
frequencies are projected upon the X 2 X 3 plane, the bivariate 
frequency distribution for these two variables shown in Table 43 
is obtained. In Table 43 the frequencies of the trivariate fre- 

Tablb 43 

0 1 2 3 X 2 


5 


10 


15 
Xs 

quency distribution in Fig. 117 are added from top to bottom. 

If the frequencies are projected upon the X 1 X 2 plane, the 
bivariate frequency distribution of and X 2 shown in Table 44 

Table 44 

Xi 

- 300 


200 


100 


I II I 

0 1 2 3 X 2 

is found. To obtain the frequencies in Table 44> the frequencies 
in the trivariate frequency distribution (Fig. 1177, are added 
from front to rear. 
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Finally, if the frequencies are projected onto the XiXt plane, 
the bwariate frequency distribution of Xi and Xt shown in 
Table 45 is obtained. The frequencies shown in Table 45 are 
the sum from left to right of the frequencies in the trivariate 
frequency distribution shown in Fig. 117. 

45 

300 

200 

100 
“ 0 

Xa 15 10 5 0 

In the three bivariate frequency distributions shown in Tables 
43 to 45, it is to be noted that and Xz are positively corre- 

lated, as are also Xi and X2 and Xi and X3. The given tri- 
variate frequency distribution (Fig. 117) is one in which all the 
variables are positively correlated with each other. In this 
case, as the values of X2 and Xz both increase, the mean value 
of Xi also tends to increase; in other words, the plane of regres- 
sion of Xi on X2 and X3 would slope upward from the origin in 
both the X2 and the X3 direction. Because of the all-round pos- 
itive correlation between the variables, the other planes of regres- 
sion would also slope upward from the origin in both directions. 

The net regression between two variables in a multivariate 
distribution is measured by the b statistic, and it is possible to 
have a negative net regression 612.3 although the Pearsonian 
coefficient of correlation ri2 is positive, and vice versa. If ri2 
is small compared with ri3 and r23, the latter being either both 
negative or both positive, the plane of regression of Xi on JSl 2 
and Xz may slope downward in the X2 direction even if ri2 is 
positive. The statistic 612.3 is of the same sign as ri2 so long as 
^12 — ri3r28 is of the same sign as ri2. If this condition is not 
fulfilled, that is, if ri2 — ri3r23 and ri2 are of opposite sign, 612.3 
will be opposite in sign to ri2 and the plane will slope in the 
opposite direction from that indicated by the sign of ri2, which, 
when multiplied by the ratio <^1/0-2, describes the slope of the 
line of regression in the bivariate distribution of Xi and Xa.* In 

* See pp. 34^351. 
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the case where vn is positive but 612.8 is negative, the coefficient 
of partial correlation ri2.8 is negative, agreeing with the sign of 
the 6 statistic. For this reason, the partial-correlation coefficient 
may be said to measure the net correlation between the two 
variables. 

If the net correlations between X\ and and between Xx 
and Xz are both negative, the plane of regression oi Xi on X2 
and Xz slopes downward in both directions. In this instance, 
the 612.3 and the 613.2 of the regression equation are both negative. 
In other words, the mean value of Xi would tend to decrease 
with increases in the values of both X2 and Xz* This particular 
plane of regression would have an all-round negative slope. If 
the net correlation between Xx and Xz is negative, however, and 
that between Xx and X2 is positive, the plane of regression of 
Xx on X2 and Xz slopes upward in the X2 direction, that is, the 
mean value of Xx increases as X2 increases; and the plane slopes 
downward in the Xz direction, that is, the mean value of. Xx 
declines as Xz increases. In this instance, 612.3 is positive, 
and 6 18. 2 is negative. The plane of regression shows a positive 
relationship in one direction and a negative relationship in the 
other direction. 

These are a few of the possible forms that a trivariate fre- 
quency distribution might take. Others include nonlinear 
relationships. For example, the mean value of Xx might first 
increase as X2 increases and also as Xz increases and then later 
decrease as both these variables continued to increase, or Xx 
might decline in the X2 direction after a certain point but con- 
tinue to rise in the Xz direction. For either of these combina- 
tions, a curved plane or surface of regression would give a better 
fit than a straight plane. 

In order that there be all-round independence, that is to say, 
absolutely no correlation whatsoever, either linear or nonlinear, 
between any of the variables, the following conditions must exist: 
• 1. The distribution of Xx for any given X2 and Xz class inter- 
vals, that is, the distribution of values for any given vertical 
shaft, must be of the same form, it must have the same mean, 
the same standard deviation, etc., even though it does not have 
the same number of cases, as the distribution of Xi values for 
every other vertical shaft of the trivariate frequency distribution 
(see Fig. 117). 
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2. The distribution ol X 2 values for any given X\ and Xz 
class interval, that is, the distribution of X 2 values in any given 
horizontal shaft parallel to the X2-axis and perpendicular to the 
X\Xz plane, must be of the same form as the distribution of Xz 
values in every other horizontal shaft parallel to the Z2-axis 
(see Fig. 117). 

3. The distribution of Xz values for any given Xi and Xz 
class interval, that is, the distribution of X3 values in any given 
horizontal shaft parallel to the Xs-axis and perpendicular to 
the X\Xz plane, must be of the same form as the distribution of 
Xz values in every other shaft parallel to the Xa-axis and per- 
pendicular to the XiXz plane (see Fig. 117). 

A close study of a multivariate frequency distribution is 
therefore always desirable before attempting to calculate any 
measure of relationship. Since in some instances the net corre- 
lation may be of opposite sign from simple linear correlation, as 
illustrated above, the examination of separate bivariate dis- 
tributions for each pair of variables is not always a reliable 
method. It is better to undertake a study of the multivariate 
distribution. In a trivariate problem a diagram similar to 
Fig. 117 could be set up, but for a large number of class inter- 
vals it would be extremely difficult, if not impossible, to draw. 
The multivariate distribution can be studied, however, by select- 
ing all the Xi and Xz variates associated with a given range of 
Xz variates, for example, in Fig. 117, all the Xi and Xz variates 
associated with values of X3 from g5 to 10. In this manner, a 
series of frequency distributions of Xi for varying values of X2 
is obtained. 

Table 46, — ^Values op Xi and Xz Associated with Values op 
X a BETWEEN 5 AND 10 

0 1 2 3 Xa 

200 
100 
Xi 0 

Similar tg^bles could be constructed showing the values of Xi 
and X2 associated with values of X3 between 0 and 6 and with 
values of Xz between 10 and 15. In this manner, the net corre- 
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lation between Xi and X2 can be studied? and, by a similar 
procedure, the net correlations between Xi and X3 and between 
^2 and Xz can be examined. If such a study should reveal that 
linear relationships prevail, the methods to be discussed in the 
ensuing sections could be applied. If simple curvilinear rela- 
tionships are apparent, some curved plane might better be fitted 
instead of a straight plane. In some instances, the latter could 
be accomplished by using logarithms, reciprocals, or some other 
transformation of the variables, to which linear functions could 
be fitted; in other instances, it might be necessary to fit parabolic 
functions to the original units. 

MULTIPLE LINEAR REGRESSION 

The a^s, 6’s, and jS's of a linear plane of regression are calcu- 
lated in terms of given data or of quantities easily calculated 
from the data. The jS's can be evaluated in terms of the simple 
correlation coefficients, the r^s; knowledge of these ther^^fore 
permits the immediate calculation of the former. The Vh can 
be computed readily from the jS^s by multiplying by the proper 
ratio of standard deviations [see Eqs. ( 4 )]. Finally, the a^s can 
be computed from the Vs and the means of the different variables. 

The common method of evaluating the / 3 ^s is the method of 
least squares. It was pointed out that for three variables there 
are three planes of regression. Values for the regression sta- 
tistics in the regression equation of Xi on X2 and Xz are derived 
by minimizing the sum of the squares of the deviations of the 
actual values of Xi from those (Zj) given by the plane of regres- 
sion, that is, by minimizing S(Zi — Similarly, values for 

the regression statistics in the regression equation of Z2 on Zi 
and Zs are derived by minimizing the sum of the squares of the 
deviations of Z2 from those (Zj) given by the second plane of 
regression, that is, by minimizing S(Z2 — Zj)^. Finally, the 
values of the regression statistics in the regression equation of 
Xz on Zi and Z2 are derived by minimizing S(Z8 — Zj)^. All 
three planes of regression are thus fitted by the method of least 
squares, but in each case the sum of the squares of a different 
set of deviations is minimized. 

Using the third form of regression equation, the ^lues of the 
statistics for the plane of regression of Zi on Z2 and Za are 
derived as follows: 
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In the equation for the plane of regression, 


““ — Pl2.3 r Pl3.2 — 

O’ 1 0^2 0*3 

the problem is to determine fin.s and fiu.2 such that 


S (f; - ^) - 2 (7; - 7. - 7) - 

Since ^ is merely 

i ^ [(X, - ZO - {X[ - X,)Y = ^ (Zi - X\) 

it follows that 2(Xi — will be a minimum when 


minimum ( 5 ) 


is merely 


V 

w \flri aj 


is a minimum, and hence the plane of regression derived by 
minimizing the latter is the same as that derived by minimizing 
the former. 

If y ( ^j2.3 /3i3.2 — ) is to be a minimum, the deriva- 

w Vi 0^2 <^3/ 

tive of this sum with respect to i3i2.3 must equal zero and also its 
derivative with respect to 1813.2 must equal zero. These condi- 
tions are expressed in the following equations: 

(J2 \<ri 0*2 <^z/ { 


V £3 / _ 
^ (Ta \<ri 


/Q ^2 

P12.3 

0*2 


If in these equations the indicated multiplication is carried out 
and if each equation is divided by Nj they become 

11X1X2 Q 2X2 n SX3X2 ^ \ 


^XiXi ^ 'ZxiXi a 
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But it will be noted that ~ ~ 

2^2 2«ic^ 

'W ~ ~ Hence Eqs. (7) reduce to the following: 


?"12 “ Pl2.3 ^ 18 . 2^23 “ 0 

^13 •“ /3i2.3^23 ^ 13.2 = 0 


( 8 ) 


When solved by the ordinary method of simultaneous equa- 
tions, 


/Sl2.8 = 


ri2 — ri8r23 
1 ^23 


013.2 — 


T\3 — ^12^23 
1 - r|3 


' 

(9) 

i 


From Eqs. (9) it will be noted that, when r 23 = 0, 0i2.3 = ^12 
and j3i8.2 = rig.* 

If the other planes of regression are put in the form 


£2 

<72 


— - /Q I o ^3 

— P21,3 T P 23 .I — 

<71 <73 


£3 

(T3 


031.2 + 032.1 


£2 

(72 


and the values of the jS^s are determined in the same manner as 
the values of 0i2.3 and 0i3,2 were determined, the following results 
are obtained: 


021.3 

023.1 

031.2 
032.1 


ri2 - ri3r23 

1—^2 
A ^18 

^23 — r^riz 

I -rU 

Ti 3 ~~ ^12*^28 

1 

r28 — ri2ri3 

1 - rfj 



If the simple linear-correlation coefficients are known, there- 
fore, it is possible to obtain all the P’s that enter into the three 
multiple-regression equations; and the regression equations in the 
j8 form are thus determined. The other forms of thS regression 


* See pp. 417 and 421. 
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equations can be derived from the p form by calculating the 
b’a fr«m the P’s, and the o’s from the b’s and the means. For 

example, Eqs. ( 4 ), such as 612.3 = 1812.8 — ’ will give the values of 

a 2 

the 6's. The regression equations in the form 
Xi = 612.3^2 4 " biz. 2 Xz 

are then determined. If, in this latter form, X[ — Xi is sub- 
stituted for x[j X2 — X2 for X2, and X3 — X3 for xz, the equation 
becomes 

Xi = Xi bi 2 . 3 X 2 biz. 2 Xz H“ ^12.3X2 &13.2X3 

from which it may be seen that the value of ai.23 is as follows: 

ai .23 = Xi — ?>12.3X2 -- 613.2X3 (10) 

Similarly, the value of a for the other regression equations is 

found to be as follows: 

(I2.1Z = X2 621.3X1 — 623.1X3 ( 10 ^) 

(I3.12 = X3 — 631.2X1 — 632.1X2 ( 10 ^^) 

It is helpful in the use of these equations to remember the 
symmetry in the notation, that is, the symmetry in the position 
of the subscripts. 

Second-order Variances for Linear Plane of Regression. The 

formulas derived below measure the dispersion of the individual 
items about the plane of regression fitted by the method of 
least squares. As in the simpler case of the line of regression, so 
also for the plane of regression, the mathematical procedure con- 
sists in finding the standard deviation of the deviations of actual 
X values from the estimated values (X') given by the planes of 
regression. For example, by definition, (r 5.23 ~ 2(Xi — X[y/N, 
The task reduces to one of evaluating such expressions in terms 
of quantities already known, that is, the r^s and the / 3 ^s. This 
can be done as follows: 

Since it has been found easier to work with the variables when 
they are converted into deviations from their respective means 
and expressed in terms of their standard deviations, the formula 
<^1.28 wili first be put in that form. This can be done by sub- 
tracting Xi from Xi and adding it to X[ and by multiplying both 
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numerator and denominator by <r\, neither of which will affect 
the value of the expression. Thus, ' 


, __ s(Xi - x\y __ «rf2[(Xi - XO - (X{ - Xi)v 

jf ^ 


<r?S (- - 
Vl Vi/ 


N 


2 2d* 

= (11) 


where 


d = Hi - 

<7i 0-1 


The problem is to evaluate 

By Eqs. (3), the third form of the regression equation of Xi on 
X 2 and Xa, it follows that 


d = ^ 


(Ti CTi 


0-1 


3 12.3 ■ 


(T2 


' 0*3 


( 12 ) 


Accordingly, for any given set of values of Xi, X 2 , and Xb, there 
corresponds a particular value for d, which is the deviation of 
the actual value of Xi from the value of x[ obtained by putting 
the given values of X 2 and Xz in the regression equation. There 
are just as many d's, therefore, as there are different sets of 
values of Xi, X 2 y and Xa. If any one of these d^s is squared, 
Eq. (12) gives 

d* = H* d - ,813.2 - d (13) 

(T 1 (T2 (Tb 


If all values of d are squared and summed, the following 
equation results: 


2d2 = 


Sxid 

O’! 


^12.8 




"" PlZ ,2 


^Xzd 

(Tb 


(14) 


But from Eqs. (6) and (12), it will be noted that 


Sxad 

- ^ 1 

(- 


0-2 

\tri 

2z3d 

_ SXsj 

fxi 

0-8 

0*8 " 



a o 

— P 12.3 PI 8 . 2 — I 

a 2 fTB/ 

<^2 . 0^3/ 


= 0 


= 0 


and 
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Therefore, 


= 


<^1 


(15) 


The evaluation, therefore, of 



will solve the problem. 


This can be done as follows: 

If each dj as shown in Eq. ( 12 ), is multiplied by the Xi/ai to 
which that d belongs, the following result is obtained: 


Xid 


X 


1 ^ XxXz 

^ ph.3 PI 


<ri (72 


(71(73 


(16) 


Values of 


Xid 

(Tl 


for all values of d and Xi sum up as follows: 




J^XiX2 
(7 1(72 


^13.2 


^ XiXz 
(7 1(73 


Hence, dividing by V, 


(17) 


Sd* _ Xid _ 2x1 a 2xiX 2 ^ 2xiXs 

N ~ Z/N<n~ N<tI Naiai * JV<ria-3 

But since 



2 xiXa _ 2 x 1 X 3 _ 

Nciot Nviffi 


it follows that 


2 d* 

N 


= 1 — jSia.sria — ^is.*ri3 


(18) 


and, finally, from Eqs. (11) and (18), 

0’i. 23 = = <ri(l — ^12.3^11 — ^ 13 . 2 ^ 13 ) (19) 


This gives an easy method for evaluating when the r’s 
and |3’s have been calculated. Similar formulas for evaluating 
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0-2.18 and 0-3.12, the scatters about the other planes of regression, 
are found to be as follows:^ *' 

<^ll 3 <^2(1 — ^23.1^23) (190 

<^3.12 ~ *^3(1 — ^31.2ri8 ^32.ir23) (19") 

Note the symmetry of these three equations. 

COEFFICIENT OF MULTIPLE CORRELATION 


The multiple-correlation coefficient measures the correlation 
between the dependent variable and the two independent 
variables taken together. For reasons previously indicated, ^ 

1 The dispersion <r* 23 may also be calctilated from the formulas: 

^ ^‘1.2(1 ■“ ^13.2) and <^1.28 “ *^i(I ^12) (t “ ^^3.2) 

This may be demonstrated as follows: From Eq. ( 23 '), 


and 

1 “ ^ 18.2 = 
This gives 


-2 

1^23 

(1 - rf2)(l - r\,) 


2 Ha + rJar^a ““ 2 ri 8 rur 28 

^ 13.2 


(1 ■" rj 2 )(l ^13.2) ■“ 


1 - r?2 - rla -f T\^rli - r?a - -f 2risri2r28 

(1 -rj2)(l -r^a) 

(1 - r^a) - (rfg + f‘\% - 2ri8ri2r28) 

(1 - ria) 

, (ri2 — Tiaras) (ria - TiaTaa). 

1 - rii 


(1 - rl,) 


(1 - T*a) 


Equation ( 9 ), however, shows that the two fractions on the right are equal 
respectively to jSia.s and /3i3,2. Hence 

(1 Ti 2)(1 Tja.a) “ 1 “• Ti2012.8 — Ti8/3i8. 


or on making use of Eq. ( 19 ), 

oj.23 “ vjd - rh)(l - rfa.2) 

In Chap. XIII (p. 321 ) it was shown that <rj.2 “ vf(l — rja). Hence the 
last equation may be written 

<^ 1.28 ** <^1.2(1 - rii.2) 

Thus both of the original formulas are derived from previously demon- 
strated relationships. Similar formulas hold for <r|.,a and These are 

•rlit “ - '•?».») 

" ^(1 - - r?,.,) 

and 

®’ 8 .i 2 ^ 

- <^.(1 - rj,)(l - r!,.,) 

> See pp. 351-353 and 365-368. 
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1?2 — 1 _ 

^1.23 — J- ~r 

may be taken as a good measure of multiple correlation. 

Ri.%z measures the degree of association between the Xy 
variable and and Xz taken jointly. It can also be looked 
upon as a measure of the goodness of fit of the plane of regression 
of Xi on X2 and Xz to the set of Xy values. .For if the fit is 
perfect, <7-1.23 will be zero and hence -R1.23 will equal 1. Simi- 
larly, -R2.18 measures the degree of association between the X2 
variable and Xy and Xz taken jointly, and 123.12 measures the 
association between the Xz variable and Xy and X2 taken jointly. 
They also can be looked upon as measures of goodness of fit of 
their respective planes of regression. It will be recalled that all 
three of these multiple-correlation coefficients may have dif- 
ferent values. 

The multiple coefficient of correlation Ry.23 is always larger 
than or at least equal to ri2 and ryz; for it stands to reason that Xy 
can be estimated better (or at least no more poorly) from two 
variables X2 and Xz than from X2 alone or Xz alone. Similarly, 
R2.18 is greater than, or at least equal to, ri2 and r23; and i23.i2 is 
greater than, or at least equal to, and r23. Furthermore, 
RI23 is equal to the sum of rfg and rfg if X2 and X3 are independent 
of each other; for, by Eq. (19), it follows that 

2^2 =1 ^1.23 __ ^l(i- ^ 1 2 . 3T 1 2 ^13.2^13) (20) 

al <tI 

= 012 . 3^12 4 " ^ 13 . 2^13 

If X2 and Xz are independent of each other, r23 = 0; and, by 
Eqs. (9), I 3 i 2.3 = ri2 and fiyz,2 = ryz. Accordingly, if r23 = 0, 

Rl23 = + rl (21) 

Similarly, if ri3 = 0, 

Rliz = rl, + rl (210 

and if ri2 = 0, 

Rin = rl + r| 3 - (21'0 

Consequently, by adding to the regression equation a second 
variable that is independent of the first, the accuracy with which 
the depehdent variable can be estimated is increased by the 
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amount of the correlation between that variable and the, newly 
added variable. 

It should be noted that only in special instances can a definite 
sign be given the multiple-correlation coefficient, although it is 
usually assumed to be inherently positive. For, as was indicated 
aboye, it may happen that the plane of regression to which a 
given multiple-correlation coefficient pertains may slope upward 
in one direction and downward in another direction, indicating 
a positive relationship between the dependent variable and one 
independent variable and a negative relationship between the 
dependent variable and the other independent variable. In 
such an instance, the correlation between the dependent variable 
and the two independent variables taken jointly, that is, the 
multiple correlation, cannot be said to be either positive or 
negative. For such a multiple-correlation coefficient, no sign is 
attached. It is only when the dependent variable is positively 
or negatively correlated with each and every one of the inde- 
pendent variables that the multiple-correlation coefficient can be 
given a positive or negative sign. 

COEFFICIENT OF PARTIAL CORRELATION 

In the preceding sections of this chapter the discussion has 
centered on the problem of estimating the value of one variable 
from one or more other variables by means of a regression equa- 
tion. In connection with this problem, a coefficient measuring 
the degree of association between the dependent variable and the 
independent variables as a group was evaluated to show the 
accuracy with which such estimates can be made. This 
coefficient is a measure of the goodness of fit of the plane of 
regression. 

When there are interrelationships among three or more 
variables, another problem appears. It often happens that an 
apparent relationship between two variables is in reality the 
result of their individual connection with a third variable that 
commonly affects them both. For example, it may be that the 
correlation between the price of cotton and the price of rayon 
is due largely to the correlation of each of them with the index 
of wholesale prices. In other words, the concomitant move- 
ments in the prices of cotton and rayon may be due, funda- 
mentally not to any direct relationship between these two 
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competing commodities, but primarily to their common tend- 
ency to rise and fall with the general price level; they may be 
joint effects of a common cause. Similarly, the concomitant 
variations in first- and second-semester English grades of 
freshmen in a woman’s college may be basically accounted for by 
their respective relationships to the grades attained by the same 
freshmen in verbal scholastic-aptitude tests or to their school 
records. 

The statistical device for discovering how much correlation 
there is between one variable and another variable when a third 
variable or a number of other variables are ‘‘held constant” is 
the method of “partial correlation.” The correlation between 
the freshmen grades in second-semester English, X\, and the 
freshmen grades in first-semester English, X 2 , when the grades 
of the respective freshmen in verbal scholastic-aptitude tests, X 3 , 
are held constant is the partial correlation between Xi and X 2 . 
Such a partial correlation coefficient will show how much con- 
nection there is between grades in first- and second-semester 
English independent of their common connection with grades 
in verbal scholastic-aptitude tests. The coefficient of partial 
correlation, indicated in this instance as ri 2 . 3 , will measure the 
degree of this independent association.^ 

A variable is, of course, not held constant in any physical 
sense. It is not possible in any way ex post facto to change the 
fact that a Mount Holyoke freshman, who had grades of 160 
in first-semester English and 160 in second-semester English, 
had also a grade of 437 in her verbal scholastic-aptitude test; 
nor is it possible to change the fact that another Mount Holyoke 
freshman, who had grades of 120 in first-semester English and 
160 in second-semester English, had also a grade of 384 in her 
verbal scholastic-aptitude test. The ideal of holding cbnstant 

^ The position of the point in the subscripts of ri2.8, rather than the fact 
that it is a smaller letter, distinguishes it from Ri.28- In the latter, the 
point comes after the first digit, setting off the two independent variables 
Xi and X2, jointly associated with the dependent variable Xi. In the 
coefficient of partial correlation, the point sets off the variable that is held 
constant coming immediately after the pair that are correlated, Xi and 
X*. Thus, in ri2.3, Xi is held constant while Xi and X2 are correlated; 
iu ri2.846, Xif X4, and Xs are held constant while Xi and X2 are correlated. 
The symbol 1.2148, by the position of the point, indicates a multiple-cor- 
relation coelQIcient between Xi, dependent, and X2, Xs,* X4, and X# taken 
jointly as the in^iependent variables. 
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one of the three variables is wholly a statistical idea. It cgnsists 
in eliminating from each of the two variables between which 
the partial correlation is sought the effect of the third variable. 
More specifically, the line of regression of on Xa is found, and 
the deviations of the actual values of Xi from those given by the 
line of regression X[ = ai.a + buXs are determined. These 
deviations from the line of regression represent the variation in 
Xi that is left over after the linear effect of Xz is eliminated. 
Similarly, the line of regression of Xz on Xz is computed, and 
the deviations of the actual values of X2 from those given by the 
line of ^regression X2 = ^2.3 + bzzXz are determined. These 
deviations from the line of regression represent the variation in 
Xz that is left over after the linear effect of Xz is eliminated. 
When these residual deviations in Xi and Xz are correlated, the 
result is the partial-correlation coefficient between Xi and Xz 
when Xz is held constant, because the effect of Xz upon each of 
them has been eliminated. 

To calculate a partial coefficient of correlation the extended 
calculations involved in computing two lines of regression and 
measuring the deviations of the actual values from them is not 
necessary. The coefficient of partial correlation can be alge- 
braically evaluated in a formula that makes it possible to compute 
it from the coefficients of simple linear correlation, as follows: 
The deviation of Xi from the line of regression of Xi on X3 

may be written as xi — x[ or Xi — ris ~ xz where the x^& are 

0*3 

measured from their respective means. Similarly, the deviation 
of Xi from the line of regression of Xi on X3 may be written as 

X2 — x. or Xi — Tit — Xa. The standard deviations of these 

(Tz 

deviations from the lines of regression have already been deter- 
mined to be <ri.8 and 0-2.3, respectively. In accordance with the 
ordinary formula for a simple correlation coefficient, that is, 
r = XxiXz/N<ri<rz, the partial-correlation coefficient between Xi 
and. Xz, when Xz is held constant, is, by definition, 

_ _ S(xi xO(x* - XjO 

Tl2,t 

afa— 
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If the^ numerator is expanded and the values for ci.t and Vi.j 
are substituted in the denominator, this becomes 

^ 3 : 1*2 - ^ *1*3 “ ^13 ^ ^ ®s*3 + risras ^ ^ 

-^<^1 Vi — rla (72 Vl — rli 

Upon transferring the divisor N(ria 2 from the denominator to 
each term of the numerator, this becomes 

'E 1 X 1 X 2 ^ ^ ^XiXz __ (Ti "Zx^Xz , 0 ’ 1 <T 2 2 X 3 

_ N<t\(T 2 <T2 Naiffz (T^ Na^az 

""" “ “ Vl ■- r!3 Vl^~ ia 

in which ri 2 , ria, r 23 can be substituted for their respective equiva- 
lent values, making the formula appear as follows: 


ri2.3 = 


ri2 — r23ri3 — ri3r23 + ri3r23 


vr 


Vl -- 


^2 

• 23 


which reduces to 


rf5.3 = 


Vl 


^ i 4 

ri2 - ri3r23 » 

V 


' 13 


Vl “ rli 


"(23) 


Similar formulas for the partial correlation between Xi and 
Xz when X 2 is held constant and the partial correlation between 
X 2 and X 3 when Xi is held constant are as follows:^ 


ri3 — ri2 r23 

Vl - »*!2 Vl - rfs 

^2 3 — T\ 2 T\z 

Vi Vl 


(230 

(23'0 


From Eq. (23) it can be seen that if Xi and X 2 are both uncor- 
related with X 3 , that is, if ri 3 and r 23 are zero, then ri 2.3 = ri 2 . 
Similarly, if ri 2 and r 23 are zero, ri 3.2 = ^ 13 ; and if ri 2 and Viz 
are zero, r 28 .i = ^ 23 . 


^ For the coefl&cient of partial correlation, the order of the numbers in the 
subscripts either before or after the point is a matter of indifference; that 
isj ^*12.3 “ r2i.^ and ri2.346 “ ruAsbi etc. It will be remembered that this is 
not true with respect to the order of the numbers in the subscripts before 
the decimal* in t^p l>*s and the p’s; that is, &12.3 ^ ^21.8) P12.8 ^ P21.8. 
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Any one of the following formulas; which can be yeadily 
derived algebraically [see Eq. (9)] may be used in place of, or as 
a check on, Eq. (23) : 


ri2.8 = \^Pl2.zfi21.S 

(24) 


(26) 

0-1 0-2.8 

ri2.8 — pu.a 

0-2 0*1.8 

(26) 

^12.3 — O 12.3 1 

cri.z 1 

(27) 


Thus the partial-correlation coefficient can be calculated directly 
from the i3's or from the b^s and the dispersion formulas for the 
simple lines of regression, as well as from the simple r^s. The 
equations for the other coefficients of partial correlation are 
symmetrical with Eqs. (24) to (27). 

The partial coefficients of correlation illustrated above are 
called correlation coefficients of the ‘‘first order, while the sim- 
ple coefficients Tn, etc., are called “zero-order'^ correlation, 
coefficients. If there are more than three variables involved so 
that a partial coefficient of correlation, ri 2 . 84 , for example, is 
found, it is called a “second-order" coefficient of correlation; 
similarly, r 12.345 is a “third-order" coefficient of correlation, etc. 
This classification is helpful in distinguishing different sets of 
correlation coefficients. The same terminology may be con- 
veniently carried over to the other statistics in a correlation 
problem. Thus, o-i is a zero-order standard deviation, <ri .2 is a 
first-order standard deviation, etc.; bi 2.3 is a first-order regression 
statistic, bi 2.84 is a second-order regression statistic; etc. 

ANALYSIS OF VARIANCE IN MULTIPLE CORRELATION 

When a plane of regression, for example, Xi on X 2 and Xz, 
is fitted to a trivariate frequency distribution by the method of 
least squares, variation in Xi may be viewed as made up of a 
part that is due to its linear association with X 2 , a second part 
that is due to its linear association with Xzy and a third part 
that is due to association with factors independent^of hoth X 2 
and Xz- For the least squares Equation ( 6 ) show that 
2 a; 2 d = 0 and Sxsd =* 0 , which means that neitfeier X^ nor X 3 
is linearly correlated with deviations from the plane of regresr 
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sion. ^In the case of a normal trivariate frequency distribution, 
the independent variables are not correlated in any way with 
the deviations from the plane of regression. Owing to the lack 
of correlation, the variance in the dependent variable is equal to 
the variance of the values given by the plane of regression plus 
the variance of the deviations from the plane. This may be 
shown as follows: 


Xi = {xi — x^^ 

and 

x\ = {x[y ~ 2{xx - x[)x[ + {x, - x[y (28) 

If all the deviations squared, like x\^ described in Eq. (28) are 
added, the following result is-obtained: 

Sx? = ^{x[y + 2(a;i - x[y - 22:(xi - x[)x[ 

( ^2 ^3 1 

/3u. 3 — + j3i3.2 — ) <^i for the last x',. 

Oi 03 / 

^ a;f = ^ (x'l)* + (a:i - *0* - 2<ri;8i2.3 ^ (xi - *0 ^ 

— 2(7 1^13. 2 (Xi — x'l) — 

But, as just stated, the deviations Xi — x\ are not linearly corre- 
lated with X 2 and x^^ so that the cross-product terms are zero. 
Therefore, 


zx? = i:(x[y + s(xi ~ x[y 

If each term is divided by N, it is found that 


^2 _ ^2 I ^2 
^1. 


28 


(29) 


In Eq. (29), ol^> may be further evaluated, as follows:' 

= (/5!3., + ^?3.3 2) ^ + 2/532.3^1,.* 2) f-;) 

• Since x[ - f - + /Sis.. i. See p. 411. 

• \ <72 < 78 / 
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If the above is divided by N and or*, cr*, and m are substituted 
for equivalent values, the expression becomes 

~ + i®18.2 + 2r28^12.8^18.2) (30) 

By substituting this value for in Eq. (29), it is found that 

<t\ == 0'1(/3?2.8 + filZ.2 + 2r23^ 12.3^18.2) + 0’l.28 
or 

1 = i^l2.8 + ^13.2 + 2r28^12.8^13.2 + ' (31) 

From the manner in which Eq. (31) was derived, it is known 
that the terms on the right side each represent a percentage of 

^2 (Hher 
— factors than 
\X2andXi 


Fig. 118. — Illustration of coefficients of direct determination, meaning of 2r“/3 
cross-product term and the residual variance. 

% 

the total variance of Xu This may be interpreted in the follow- 
•ing manner: 

Coefficients of Direct Domination. The first term of the 
right side of Eq. (31), jSf,.,, may be interpreted as the percentage 
of the total variation in that is due to its direct association 
with Xj. It has consequently been called the “coefficient of 
direct determination” of Xi by Xz. Similarly, may be 
interpreted as the percentage of the total variation in Xi that 
is due to its direct association with Xi. Figure 118 depicts the 
coefficients of direct determination by arrows pointing from Xt 
directly to Xi and from Xi directly to Xi. , 

Coefficients of Net Regression. The beta unsquared, /Sw.s, 
describes the change in Xi in standard-deviation units that 
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accompanies a given change of in standard-deviation units, 
when Sl 8 is constant. Geometrically, is the slope of the line 

of intersection of a plane perpendicular to the — axis with the 

(TZ 

plane of regression, 


(Tl 


Pl2,Z “ + Pn.2 
<r2 


(TZ 


The statistic jSn.z has been called the '^coefficient of net regres- 
sion of Xi on X2 in standard-deviation units. The coefficient 
of net regression in original units is 612.3. 

Coefficient of Joint Determination. The term 2 r 23 /?i 2 . 8 i^i 3.2 may 
be taken as representing the percentage of the total variation in 
X\ that is due to the joint or combined effect of Xz and Xz 
resulting from the correlation between these two variables. In 
Fig. 118 the influence of Xz on Xi through its correlation with Xz 
is depicted by the line aa'; the influence of Z2 on Xi through its 
correlation with X3 is depicted by the line 66'. Relationships 
along these lines indicate the significance of the r-jS cross-product 
term of Eq. ( 31 ). While variation in Xz may directly affedt 
Xi, it may also, through its correlation with X3, bring about a 
change in X3 and hence cause further variation in Xi resulting 
from the connection between Xz and Xi. Similarly, variation 
in Xz may affect Xi, not only directly, but also indirectly 
through the association of X3 and X2. The term 2 rzzPi 2 .z&iz .2 
may be taken to represent the combined indirect variation in 
Xi resulting from variations in X2 and X3. 

Meaning of Residual Variance. The portion of variance in 
Xi due directly to X2 is ^12^9 the portion due directly to X3 is 
i^i3.2> fhie portion due to the joint influence of X2 and X3 is the 
2r-i3 cross-product term; the remainder of the variance is due 
to other factors not linearly correlated with X2 and X3. This 
is depicted in Fig. 118 . The sum of all of these four terms is 
equal to the total variance, or, expressed as a proportion, to 1. 
The sum of the first three terms may be interpreted as the total 
portion of variance in Xi that is due to its association with 
X2 and Xs jointly; the final term c\,zzlA represents the portion 
of the variance in Xi that is due to its association with other 
factors not*linearly correlated with X2 and Xs. The sum of all 
these portions of the total variance of Xi is necessarily equal to 1, 
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The Coefficient of Multiple Correlation. From the previous 
discussion regarding the interpretation of the simple corfelation 
coefficient and the correlation ratio, it is to be expected that a 
similar interpretation might be made of the multiple-correlation 
coefficient, that is, that represents the portion of the total 
variance in X\ which is due to its joint association with X 2 , and 
Xs. Equation (31) shows that this is actually the case; for it 
will be recalled that 


•Bi.23 — 1 ~ 


_2 

^ 1.28 


Therefore, it follows by Eqs. (29), (30), and (31) that 




= ^12.8 + ^18.2 + 2r2i^l2.8/5i8.2 = 


(Tl 


(32) 


This interpretation of is similar to that previously made of 




rl 2 = and 


2 _ 
^12 


where is the standard deviation of the line of regression. 

It has been noted that Ej.ga can be interpreted as the portion 
of the total variance of Xi that may be attributed to its joint 
association with X2 and X3. It has also been shown that 
^1.23 = Pi 2 ,zri 2 + fiis. 2 riz [see Eq. (20)]. Consequently, it is 
possible to view ffu.zr 12 as the portion of the variance of Zi 
that is due to its total association, both direct and indirect, with 
X2 and also to view piz, 2 ri 2 as the portion of the variance of Xi 
that is due to its total association, both direct and indirect, 
with Xs- Therefore, these two products have been called ^'coef- 
ficients of total determination of Xi by X2 and X3. 

When either of the products is negative, however,-^ it is pref- 
erable to resolve the expression into its equal, namely, 

^ 12.3 4 - ffiz ,2 + ^^2zPl2.zfflZ.2 

^ The interpretation of the products as coefficients of total determination 
runs into difficulties in any particular case if either of them is negative. 
Either fii2,tri2 or but not both may be negative, since their sum 

equals JR* jsj which, of course, is not negative. To say that § variable 
contributes a negative percentage to the total variance of Xi )ias no mean- 
ing, and consequently when either term is negative the interpretation is 
imaginary. 
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because 

-Bi. 28 = ^12.8^12 + ^18.2^18 = /3i2.3 + ^?8.2 + ^1f'2zfil2,zPlZ.2 

Expressed in the and cross-product form, it is easier to under- 
stand why a negative value of fii2.zri2 or of Piz.2riz (but not of 
both) may occur; for whenever either fin.zVu or fiiz.2riz is nega- 
tive, it will be found that the joint contribution of X2 and Xz 
to the variance of Zi, represented by 2r2zfi12.zP1z.2y is also nega- 
tive. This follows because either r23 is negative or fiu.z and 
fiiz.2 are of opposite signs. In such a case, the direct effect of 
Z2, for example, on the variation of Xi would be opposite in 
sign to its indirect effect, that is, through its correlation with 
X3; and the existence of this indirect link to Xi through Z3 
would tend to diminish the total variation caused in Xi by 
changes in Z2. This dampening effect of a negative value for 
the r-fi cross-product term is what explains the existence of a 
negative value for either fin.zrn or fiiz.zriz- The form where they 
are squared also shows the difficulty in trying to assign to the 
variables Z2 and Z3 an independent part in accounting for the 
variation in Xi. When there is a joint contribution by the two 
variables, it becomes misleading to attempt to break it up and 
assign part of it to one and part to the other, as the foregoing 
interpretation based upon the form fiu.zTn + fiiz.2riz appears to 
do. 

As noted above, the part of the variance, (Ti, that is due to 
correlation between Xi and X2 and X3 together is described as 
follows:^ 

2 p2 2 

The right side of this expression describes the variance of the 
plane of regression, just as rjao-? describes the variance of the 
line of regression. This part of the variance in the Xi variable is 
made up of two parts that may be analyzed as follows:^ 

<^ 1.23 ~ ^12) (1 “ ^18.2) 

= cr? - rlz,A - rlA + 

= 

= “ rl^l - <rl2rlz.2 

' Cf. Eqs. (29) and (31). 

* See footnote, p. 416. 
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Therefore, by transposing terms, 

k 

•r? = riV! + (33) 

It follows from Eqs. (33) and (29) that 

+ <^ 1 . 2 ^ 13.2 

in which <ri .2 is the scatter about the line of regression of Xi on X 2 . 

It is thus seen that the variance of the plane of regression 
consists of two parts: (1) the variance due to the total linear asso- 
ciation between Xi and X^, which is rl^\] and (2) of the remain- 
ing variance about the line of regression ( 0 - 1 . 2 ), the portion that 
is due to the influence of Xz not already included in ri 2 , namely, 
^i 8 . 2 <^i. 2 - The partial coefficient of correlation ri 8.2 describes the 
relationship between X\ and Xz when X^ is held cqhstant; in 
other words, it is the net correlation between Xx and Z 3 . The 
square of this partial coefficient of correlation is therefore the 
coefficient of proportional variance in Xi due to net correlation 
with Xz- If to variance due to total association with X 2 is 
added the variance due to net correlation with Xa, the result is 
the variance due to total correlation with X 2 and Xs jointly. 
The remainder of the variance, as Eq. (33) indicates, is due to 
other causes not linearly correlated with X 2 and Xz- 

Analysis of Variance and Causal Relationships. Where other 
knowledge suggests that causal relationships run only in *one 
direction, the preceding analysis takes on considerable signifi- 
cance. In biological investigations, for example, where the 
effect of heredity is being studied, it seems logical to assume 
that variations in parents cause variations in offspring and that 
the causal relationship does not run the other way. Again, in 
certain economic problems, it is to be assumed that fluctuations 
in weather conditions bring about changes in economic con- 
ditions, but that the latter have no effect upon the former. In 
one-directional setups of this kind, the jS^s take on the full signifi- 
cance of the connotation ^‘coefficients of determination.'' The 
iS^'s measure the amoimt of variation in the dependent variable 
caused by fluctuations in each of the independent variables 
separately, and in conjunction with the other variables in the 
2r-i8 cross-product expression. It is in this type of problem that 
correlation analysis attains its greatest significance. 
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Whcg’e there is no reason to believe that causal relationships 
are unilateral, the interpretation of the results of correlation 
analysis in terms of causal determination becomes unscientific. 
In most problems there is interaction between the variables 
rather than a strictly one-directional association. It is still 
possible to estimate what is called the dependent variable from 
knowledge of so-called independent variables, but the latter 
must not be looked upon as determining the former. With 
reference to a regression equation used for purposes of estimation 
and prediction, the /S^’s and the 2r-jS cross-product terms are 
useful in shomng how important certain factors are, separately 
and in conjunction with other factors, in making estimates or 
predictions of the dependent variable. They become coefficients 
of determination only in the sense of statistical determination or 
estimation and not in any sense of physical, biological, or eco- 
nomic causation. 

Of incidental importance to method and theory but of con- 
siderable importance to practice, Eq. (31) provides a cross check 
on the arithmetical work for finding most of the statistics in the 
multiple-correlation problem. 

EXTENSION OF MULTIPLE- AND PARTIAL-CORRELATION 
ANALYSIS TO FOUR VARIABLES 

The foregoing analysis, which has pertained to only three 
variables, may be extended to cover a greater number of vari- 
ables. In this section its extension to a four-variable problem 
will be discussed. In the next section its extension to any desired 
number of variables will be considered. 

When there are more than two independent variables in a 
regression equation, the number of 13 coefficients to be determined 
is correspondingly increased. If there were four variables, 
for example, that is, three independent variables, the form 
of the regression equation would become 

^ + / 3 i 3.24 “ + 0U.n ~ (34) 

ffi < 7’2 (TA 

If this regression plane is fitted by the method of least squares, 
the jS^s can*be determined in terms of the r’s in the manner 
described for the three-variable problem; but, there being three 
equations to'be^olved for three the solutions are not so simple 
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as those given by Eq^. (8) and (9). Some special metl:iod of 
solution may be employed, such as the so-called Doolittle 
method^' of substitution or some determinant method.^ 

The chief advantage claimed for the Doglittle method is that 
it provides a check at each step in the problem; but experience 
for several years with its use for four-variable correlation prob- 
lems has demonstrated its complexity and sensitivity. For a 
multivariate problem of more than four variables both the 
Doolittle work sheet and the determinant methods become 
increasingly cumbersome. 

A saving in computation is obtained by using formulas based 
upon the algebraic evaluation of correlation statistics. This 
saving was demonstrated for the trivariate case in the first part 
of this chapter where it was found that the j^^s could be evaluated 
in terms of the zero-order r^s [see Eq. (9)]. In addition, however, 
it is possible by symmetry to extend the algebraic results of 
some of the formulas to apply to a multivariate correlation of 
as many variables as desired, in effect to evaluate algebraically 
the correlation statistics for the general regression equation. 

Xi ~ -j“ h%j,k, . . Xj “f" hik.j. . . Xk "b * * * “1“ hin,jk. , .Xn (3*^) 


First the correlation statistics will be algebraically evaluated 
for the four-variable case, which is done as follows: 

The normal least-squares equations for fitting a plane of 
regression of Xi on Z2, Zs, and Z4, when divided by N, are 
as follows 


I^XiX2 

N<r\<T2 

^XiXz 

N<r\(rz 

'LxiXj 

JV<rj<r4 


= /3i2.84 
= iSi2.34 
= iSl2.84 


2^2 /9 ^XzXz , « 2X20:4 

Nul 'N(T2iTz 


2X2X3 

2X2X4 

NazVi 


+ ^18.24 Tpl + /3 i4.23 


+ PlZ. 


24 


2x1 

2x3X4 

Nczo^i 


2x3X4 


Nazo’i 


These may be written, by substituting for their equivalent 
values, ru, «r|, ru, rn, ru, (r|, r84, fu, <r|, as follows: 


1 In theoretical work the determinant method gives neat and easily 
remembered solutions. In practical work, however, manV statisticians 
prefer the method of substitution. 

•See pp. 411-412. % 
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% y*12 = ^12.34 + ^23)313.24 + ^24i3l4.23 

^18 = ^23^12.84 + ^18.24 + ^34i5l4.28 
y*14 = »*24^ 12.34 + ^34/? 13.24 + i3i4.28 


(36) 


If the first of Eqs. (36) is multiplied by r 23 and the result is 
subtracted from the second, i3i2.34 is eliminated and it is found 
that 


ri3 - ri2r23 = (1 - + (r34 - r2zr 24 ) 01^.23 

or 

ri3 — ri2r23 _ ^ , r34 — r2zr24 ^ 

I - r|3~~ 

which, by Eqs. (9), is equivalent to saying 

013.2 = 013.24 4“ /543.2/5i4.23 (37) 

Similarly, if the first of Eqs. (36), multiplied by r 24 , is subtracted 
from the third, i9i2.34 is eliminated and it is found that 

014.2 = ^14.23 + ^34.2/5l3.24 (38) 


Correlation Statistics in Terms of Lower-order Correlation Sta- 
tistics of the Same Kind. From Eqs. (37) and (38), solved simul- 
taneously, it is found that 


^ _ 014,2 “ iS34.2i5l3.2 

~ ' 1 _ Pu.2043.2 


(39) 


All the other second-order may be evaluated in a similar 
algebraic manner, or written by symmetry; for example, by 
symmetry with Eq. (39), the /?i 3.24 is as follows: 


— /^43.2^14.2 

1 “■ 043.2034.2 


(40) 


Equations ( 39 ) and (40) are equivalent to the following expres- 
sions in terms of the 6 's, because, if 614.2 634.2 and similar 

(Ti as 

values of 0^s in terms of corresponding Vs and standard-deviation 
ratios are substituted, the standard deviations cancel out.^ 

^ Since the order of digits in the subscript after the point is immaterial 
in .both the 6 'landjS statistics, the following formulas may be used as checks 
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bu.2Z = 
618.24 = 


bu,2 ““ &84.2bl3.2 
1 ““ bBA.2biZ.2 

&18.2 fc48.2&14.2 

1 ““ bis.2bzi.2 


( 39 ') 

(40') 


Correlation Statistics in Terms of Lower-order r^s and cr^s. It is 
possible to express the above formulas in terms of lower-order 
r's and o-'s if this method of calculation is preferred. It was 
noted in Eq. (26) that 

^ a cri<^2.s 

ri2.z = P12.8 

0*2 1.3 

and, by symmetry, it follows that 


ri4.2 = fiu.2 


^ ^4.2 
0*4 (ri.2 


Also, by Eq. (27), 


Consequently, 


ri4.2 = 614.2 


q'4.2 

<^ 1.2 


I3u.2 = ^ 14.2 
614.2 = ^*14.2 


0[4 0’1.2 
0*1 0*4.2 
0*1.2 
0*4.2 


If these values of the jS’s in terms of r’s and standard deviations 
are substituted in Eq. (39) and the values of the 6's in terms of 
the r's and the scatter are substituted in Eq. (39'), the following 
results are obtained: 


a ri 2 .Z “ r42.*3^"l4.8 0*2 0 * 1.8 

P12.84 = i 0 

1 ^24.3 OTi 0*2.3 

7 • ri2.3 y’24.3?^14.8 0*1.3 

012.34 = 


1 ” ^24.8 


0*2.3 


(41) 

(410 


Correlation Statistics in Terms of Correlation Statistics of Same 
Order. The preceding formulas may be transformed to still 


on the ones given in the text: 

^ 14.82 

; . ^^14.82 


/3i 4.8 i924.8<3l8. 8 

1 ““ ^24.8^42.1 
h \ A.l — 624.8^12.8 
1 “ 624 . 8642.8 


( 39 ) 
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another form in which the correlation statistics are expressed in 
terms other correlation statistics of the same order. Thus, 
since <ri .84 = <^i .8 y/\ — ri4.3 and<r 2.84 = 0^2.3 y/l — r|4.8, it follows 
that 


o’l.a 

0 ‘ 2.3 


0 * 1.34 

Vl - r? 4.8 

0 * 2.34 _ 

Vi - »’247 s 


If these values are substituted in Eqs. ( 41 ) and ( 41 '), the follow- 
ing results are obtained: 


/3 i 2.34 = ^12.34-”*- (42) 

O’! 0'2.84 

612.34 = ^12.34 (42') 

0 - 2.34 

Similar algebraic procedure will show that all the other jS's 
and 6 ’s have formulas that are symmetrical with the above. 
For example, 


Q — 0 - 30 * 1.24 

P13.24 — ^13.24 — — 
O’! 0-3.24 

Jv 0^1.24 

O13.24 = ri3.24 

0*3.24 

0*4 0 ri .23 

P14.23 = / 14.23 

0*1 0-4.23 


6 i 4.23 == ^4.2 


0 - 1.23 

t - 

0-4.28 


Partial Correlation in the Four-variable Case. If there are 
four variables Xi, X2, X3, and X4, the partial-correlation coeffi- 
cient between Xi and X2 when Xs and X4 are held constant, 
that is, ri2.34, is defined as the simple correlation coefficient 
between the deviations of Xi from the plane of regression of Xi 
on X3 and X4 and the deviations of X2 from the plane of regres- 
sion of X2 on Xa and X4. Accordingly, partial correlation, when 
four variables are involved, is no more complex algebraically 
than when three variables are involved; both are simple linear 
correlations iJetween residuals. In the three-variable problem 
the partial coefficient of correlation is found by correlating the 
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residuals about lines of regression; in the four-variable problem 
the partial coeflScient of correlation is- found by correlattng the 
residuals about planes of regression. The formula for the 
second-order partial coeflGicient of correlation ri 2.34 is obtained 
in the same algebraic manner as for the first-order partial 
coefficients of correlation [see Eqs. (22) and (23)]. The formula 
is 


ri3.24 


ri3.2 -- ^14.2^34 . 2 

Vi - Vi - 


(43) 


This is a partial coefficient of correlation of the second order. 
It will be noted that the formula runs in terms of the correlation 
coefficients of the first order. To make use of this formula in 
practice, therefore, it first becomes necessary to calculate the 
zero-order correlation coefficients and then the first-order coeffi- 
cients before the second-order coefficients can be determined. 
The calculation of higher-order partial-correlation coefficients is 
thus similar to the calculation of those of lower order, but they 
require additional work. 

Third-order Variance and Multiple-correlation Coefficient. 

The formula for third-order variance in the four-variable prob- 
lem is obtained by adding a term to the formula for scatter in 
the three-variable problem, as follows: 

0’l.284 = <^l(l ^12.34^12 — ^13.24^13 “ Pu.2Zf'u) (44) 

The equation for the multiple-correlation coefficient is, by 
symmetry, as follows: 

Rhzi = 1 - (45) 

or 

•^ 1.234 Pi2.B4Ti2 ^ 13 . 24^18 + ^ 14 . 23^14 


EXTENSION OF MULTIPLE- AND PARTIAL-CORRELATION 
ANALYSIS TO ANY DESIRED NUMBER OF VARIABLES 

For a three-variable and even for a four-variable correlation 
problem, it is probably desirable first to calculate the from 
the lower-order In the three-variable correlation problem, 
this means calculating the /3^s from the zero-order V^s, which at 
that level correspond to the /3^s. 
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For more than four variables it may be better first to calculate 
the higher-order r^s from the lower-order r’s and thereafter 
obtain the h and statistics. Following is the extension to the 
general multivariate problem of the various formulas, showing the 
series of formulas in some cases from the simple correlation to 
the multivariate in order to depict how they are readily obtained 
by symmetry. 

Extension of Formulas to General Multivariate Problem. 

Statistics for the Regression Planes. The h and statistics are 
evaluated, in general, by the following formula: 


(a) 

or 


b = 


a 
r - 
<r 


bij,kt. . .n — . 


ai.jJct. . .n 
(Tj.ikt. . .n 


This formula can be used when the r^s and <r’s of the same order 
have already been obtained, but it does not provide for the cal- 
culation of the Vs from lower-order statistics. 

In the simple linear-regression equation, carrying out the 

instructions of the above formula gives 612 = ri2 If there 

<72 

are three variables, it becomes 612.3 = ^12.3 — etc. 

<r2.3 

The relationship between the Vs and the P^s is not affected 
by the number of variables and may be summarized as follows: 




fin.z = 612.3 


0-1 


/ 3 i 2.34 = 612.34^ 
<ri 


fiiz .2 = 613. f 


0-1 


Pu.2A = 613.24 

O’! 




.n 


== bij.kt. 


. .n 


(Ti 


The a’s for tile various planes of regression are found by formulas 
such as the following: 
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(c) ttl.284 = -^1 f>i2.84^2 {>18.24-^8 bl4.28-^4 

CJ2.134 = X 2 — 621.84-^1 ^28.14-^8 “* 624.13-^4 . 

O^i.jk, , ,n ~ hij.k. , .nX] hik,j,,,nXk * * * hin.kj. , ,^n 

High-order Variances, The higher-order variances may be 
estimated by using either the formulas or the partial r formulas : 

(d) <^1.284 ~ <^l(l ■“ i^l2.34?*12 ~ i5l8.24^18 ““ i5l4.28^14) 


^ujk.,.n — ®’l(f ^ij.k. . .■nTij ^ik.i,,,vTik * * ' &in,k,,Xii^ 

^1.2 ” “ ^12) 

<^1.23 == ^12)(1 ^18.2) 

<^1.234 == ““ ^12) (1 ^18. 2) (1 ^14.32) 


<rliki.„vn = <^^(1 - 4)(1 - ‘ • (1 - ^Ly*e...p) 

The MuUiple-correlcUion Formulas, As in the case of the 
relationship between the Vs and the the formula for the 
multiple-correlation coefficient is of the same form regardless 
of the number of variables; it is always a comparison of the 
residual scatter about a plane of regression with the total standard 
deviation. In the four- variable and the general multiple- variable 
cases it is as follows; 

(e) Rln< = 1 - 

2 

TD2 1 ^i.jkt . , ,n 

■^%.ikt.,.n '2 

Oi 

^ The Partial-correlation Formulas, The forms of the formulas for 
the partial coefficients of correlation are also independent of the 
number of variables in the problem, because the partial correla- 
tion coeflBicient always pertains to a simple linear correlation. 
The formulas are therefore symmetrical, as follows: 

Second-order partials: 


^12.84 


1 ^ 12.8 — ^*14.8^24.8 


Vl - »’14.8 Vl - rks 

And in general, the formula for calculating partials of a given 
order from the next lower order partials: 


Tii.k . . . n 


. . . n-*! f*in.jk , , ■ n~l rjn.ik T, , n—l 
, n— 1 ^jn.ik , , n—l 



CHAPTER XVII 


ANALYSIS OF A MULTIVARIATE FREQUENCY 
DISTRIBUTION ILLUSTRATED 

To illustrate the analysis of a multivariate frequency dis- 
tribution, data on grades of 81 freshmen in a woman^s college 
were selected. These are arranged in Table 47 in such a way 
as to facilitate the construction of simple linear-correlation tables 
and to facilitate detailed study of the multivariate distribution. 
The first part of this chapter will illustrate trivariate-frequency- 
distribution analysis. The X 4 variable is included in the table 
so that later in the chapter the analysis can be extended to four 
variables. Beyond four variables, the method proceeds in a 
symmetrical fashion. 

Examination of Multivariate Distribution. The first step in 
the analysis of a multivariate distribution is to determine how 
well the assumption of linearity of relationship is approximated. 
The scatter of cases in the correlation table for Xi and X 2 
appears to indicate simple linear regression between Xi and X 2 .* 
In Tables 48 and 49, correlation tables for Xi and X 3 and for 
X 2 and Xzj respectively, the scatter of cases suggests that these 
regressions might be only slightly if at all nonlinear. As pointed 
out in the preceding chapter, however, the simple correlation 
charts cannot be expected to reveal how much net regression 
exists in a tri variate correlation problem; accordingly, further 
study of the trivariate distribution of Xi, X 2 , and Xz, should be 
undertaken. In order to do this the multivariate distribution 
itself, shown in Table 47, must be studied. 

For example, the net regression of X 2 on Xz may be tested 
in a preliminary fashion by holding constant the Xi variable. 
Accordingly, analysis may be made of the X 2 and Xz grades of 
the 16 students whose Xi grades are^ 200 . Table 50 shows the 
X 2 and X 3 grades of these 16 freshmen, selected from Table 47. 

. ,* See Table 34, (p. 358). 
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Table 47. — Grades of 81 Mount Holyoke Freshmen in Verbal Scho- 
lastic-aptitude Test, in College Board English Examination, .nd in 
First- and Second-semester English 
(A, B, C, and D grades converted to a numerical scale) 


student 

number 

Second-semester 
English grade 

X\ 

First-semester 
English grade 

X% 

Verbal 

s.a.t. 

X% 

College j&oardl 
English 
examination 

Xi 

1 

240 

220 

573 

407 

2 

200 

180 

473 

456 

3 

260 

240 

680 

615 

4 

260 

260 

568 

671 

5 

160 

160 

437 

546 

6 

240 . 

220 

567 

615 

7 

220 

200 

671 

449 

8 

60 

120 

415 

428 

9 

220 

240 

434 

726 

10 

1 200 

180 

437 

504 

11 

220 

220 

443 

601 

12 

140 

180 

583 

449 

13 

160 

120 

384 

553 

14 

240 

260 

634 

546 

15 

260 

240 ^ 

477 

671 

16 

200 

160 

594 

553 

17 

200 

160 

495 

449 

18 

240 

240 

592 

532 

19 

240 

220 

598 

574 

20 

240 

220 

536 

650 

21 

160 

140 

408 

532 

22 

220 

240 

569 

518 

23 

200 

200 

443 

477 

24 

100 

100 

421 

518 

25 

160 

140 

356 

366 

26 

200 

160 

535 

449 

27 

180 

180 

415 

671 

28 

180 

160 

502 

539 

29 

240 

240 

477 

532 

30 

200 

200 

504 

546 

31 

200 

200 

431 

518 

32 

180 

160 

442 

518 

33 

260 

220 

509 

518 

34 

160 

120 

449 

560 

35 

240 

240 

416 

539 

36 

220 

220 

531 

587 

37 

220 

240 

518 

622 

38 

160 

120 

472 

532 

39 

200 

200 

' 619 

643 

40 

220 

220 

456 

463 
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Table 47. — Grades of 81 Mount Holyoke Freshmen in Verbal Scho- 

LASTIflk-APTITUDB TeST, IN COLLEGE BOARD ENGLISH EXAMINATION, AND IN 

First- and Second-semester English. — (Continued) 


Student 

number 

Second-semeeter 
English grade 

Xx 

First-semester 
English grade 

Xt 

Verbal 

S.A.T. 

Xi 

College Board 
English 
examination 

Xi 

41 


260 

646 

643 

42 


160 

511 

532 

43 

100 

60 

339 

442 

44 

200 

220 

424 

594 

45 

200 

200 

515 

594 

46 

160 

120 

460 

525 

47 

180 

160 

418 

504 

48 

280 

220 

581 

684 

49 

200 

200 

479 

560 

50 

220 

220 

579 

532 

51 

220 

200 

600 

594 

52 

240 

220 

610 

497 

53 

100 

60 

376 

435 

54 

220 

220 

458 

484 

55 

240 

200 

600 

497 

56 

200 

220 

489 

539 

57 

220 

220 

567 

525 

58 

220 

200 

513 

636 

50 

240 

200 

644 

581 

60 

180 

140 

521 

401 

61 

160 

140 

345 

442 

62 

240 

220 

596 

636 

63 

260 

260 

667 

567 

' 64 

160 

120 

384 

546 

65 

260 

240 

646 

511 

66 

220 

180 

500 

504 

67 

220 

240 

613 

608 

68 

260 

260 

707 

615 

69 

240 

220 

703 

657 

70 

200 

200 

391 

497 

71 

140 

120 

416 

442 

72 

260 

240 

667 

657 

73 

200 

180 

477 

664 

74 

300 

280 

709 

518 

75 

180 

140 

527 

539 

76 

220 

180 

475 

629 

77 

180 

180 

636 . 

594 

78 

300 

280 

543 

726 

79 1 

220 

220 

432 

546 

80 

220 

200 

380 

428 

. 81 

. 200 

180 

479 

539 
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second-semester English grade. Xs = verbal scholastic-aptitude test. 





































339 - 369 - 399 - 429 - 469 - 489 - 619 - 649 - 679 - 609 - 639 - 669 - 699 - 
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first-semester English grade. Xu * verbal scholastic-aptitude test. 
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Table 60 . — Xt and Grades of 16 Students Whose Xi Grade Is 200 


Student 

number 

Xt 

Xi 

2 


473 



437 

16 


594 

17 


495 

23 

200 

443 

26 

160 

535 



504 

31 


431 

39 


619 

44 

220 

424 

45 

200 

515 

49 

200 

479 

56 

220 

489 


200 

391 

73 

180 

477 

81 

180 

479 


Figure 119 shows the scatter of these 16 bivariates; it indicates 
that the regression may be linear although there appears to be 
little tendency for and Xs, unaffected by Xi, to be correlated 

160 180 200 220 240 28 

350 - 

400 _ 

450 _ 

500 _ 

650 

600 


X, 

Fia. 119. — Net regression of X% on Xs, holding Xi constant. Test based on 

Xx » 200. 

Tvith each other. Evidently violence is not done ^o the facts 
by assuming that any correlation present is linear in character. 
It must be remembered, of course, that this preliminary test pf 
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net regression between X 2 and X3, holding Xi constant, is based 
on a small sample of only 16 observations. Similar tests, holding 
Xi constant at several other values, respectively, should be 
made, especially if nonlinearity is suspected. 

Inasmuch as Table 34 (page 358) shows such a clear linear 
total regression between Xi and X2, it may be assumed that the 
net regression of Xi on X^j and vice versa, is linear. For further 
illustration of the method of examining the multivariate fre- 
quency distribution to test for linearity of regression. Table 51 
presents the joint variation in Xi and Xz grades of those students 
whose X 2 grade is 220. This will make it possible to test the 
linearity of net regression between Xi and X3, holding X 2 
constant. 

Table 51 . — Xi and Xi Grades of 18 Students Whose Xi Grade Is 220 


Student 

number 

Xi 

Xi 

1 

240 

573 

6 

240 

567 

11 

220 

443 

19 

240 

598 

20 

240 

536 

33 

260 

509 

36 

220 

531 

40 

220 j 

456 

44 

200 

424 

48 

280 

581 

50 

220 

579 

52 

240 

610 

54 

220 

458 

56 

200 

489 

57 

220 

567 

62 

240 

596 

69 

240 

703 

79 

220 

432 


The bivariate frequency distribution of the eighteen cases in 
Table 51 is plotted in Fig. 120, which shows that their scatter, 
with the exception of one case, follows a linear path. It may be 
concluded jbhat the net regression between Zi and Xz approxi- 
mates the linear form. Here again, especially if nonlinearity is 
i^uspected, siipilar tests using several different values of X2, 
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respectively, should be made. Tables 50 and SV serve onjy to 
illustrate the method, which would ordinarily need to be applied 
more completely than is done here. 

200 220 240 260 280 300 
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X, 

Fig. 120. — Net regression between Xi and Xt, holding X 2 constant. Test 

based on X*. = 220. 


Statistics of the Trivariate Frequency Distribution. Study of 
the tri variate frequency distribution Xi, X 2 , and X 3 , shown in 
Table 47, appears to indicate that regressions are linear, and 
therefore it may be assumed that the methods of correlation 
outlined in Chap. XVI may appropriately be applied. In order 
to calculate the statistics for a trivariate frequency distribution, 
it is necessary to obtain first the statistics for the various mono- 
variate distributions. 

Calculation of Zero-order Statistics, Correlation tables 48 
and 49 may be used as work sheets, as illustrated in Chap. XIV, 
to calculate the zero-order statistics. Since the problem is now 
one of analyzing a trivariate frequency distribution, it is well 
to set up a schedule for calculation of the respective statistics. 
Table 52 is a schedule of the means and variances for the three 
monovariate frequency distributions taken separately. In each 
case the mean was calculated by using the formula 





r 
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Table 52. — Means and Variances 


5 

Means 

Sums of squares 

Variances 

Standard 

deviations 

X, = 217.4 

Nal =• 156,365.56 

<r\ = 1,930.3165 

<ri = 43.94 


N<r\ = 181,155.66 

<r| = 2,236.4883 

0-2 “ 47.29 

Xt = 615.5 

N<rl = 719,222.22 

<r\ = 8,879.2866 

0-3 = 94.23 


The sum of squares of deviations from the mean in each case is 
calculated by using the formula 



The required sums are all found in the correlation tables, for 
example, 


N,t\ = 400 1^543 - — gj-] 
= 156,355.56 


= 400(543 - 152.11111) 


Table 53 is a schedule for the calculation of zero-order Pear- 
sonian coefficients of correlation, using the equation that was 
used in Chap. XIV. This equation is shown at the head of Table 
53. The entries all come from Tables 48 and 49 and Table 34 
(page 358). 

Table 53. — Calculation op Simple r's 

N 



(1) 

(2) 

(3) 

(4) 

(5) 



-0 0 

-(0 ( 7 ) 

(1) - (2) 

(4) 4. (3) 

ri2 

455.00 

78.11111 

420.74964 

376.88889 

+0.89576 

ri8 

283.00 

-68.51852 

558.90486 

351.51852 

+0.62894 

r23 

322.00 

-35.18519 

601.59915 

357.18519 

+0.59373 


Calculation of First-order Statistics. As suggested in Chap. 
XVI the first-order statistics of a trivariate frequency distri- 
bution may be calculated by several methods. The most efficient 
method appears to be to calculate the first-order and from 
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the first-order jS's to calculate the other first-order statistics. 
Table 64 is a work sheet for the orderly calculation of th6 first- 


Table 54. — Calculation op the First-order /3^s from the 
Zero-order r's^ 


^ihk = 


Tii — TikTik 


[See Chap. XVI, Eqs. (9)] 


(1) 

(2) 

(3) 

(4) 

(5) 

Zero-order r (fi) 

Product 
term of 
numerator 

Whole 

numerator 

1 - r* 

First-order 

Subscript 

Regression 

statistic 

Subscript 

Regression 

statistic 

12 

0.89576 

0.37342 

0.52234 

0.64748 

12.3 

0.80673 

13 

0.62894 






23 

0.59373 






13 

0.62894 

0.53184 

0,09710 

0.64748 

13.2 

0.14997 

12 

0.89576 






23 

0.59373 






12 

0.89576 

0.37342 

0.52234 

0.60444 

21.3 

0.86417 

23 

0.59373 






13 

0.62894 






23 

0.59373 

0.56338 

0.03035 

0.60444 

23.1 

0.05021 

12 

0.89576 






13 

0.62894 






13 

0.62894 

0.53184 

0.09710 

0.19762 

31.2 

0.49135 

23 

0.59373 






12 

0.89576 






23 

0.59373 

0.56338 

0.03035 

0.19762 

32.1 

0.15358 

13 

0.62894 






12 

0.89576 







1 Note the internal checks in columns (2), (3), and (4), in which each of three values 
occurs twice; in column (2), the first and third, second and fifth, fourth and sixth figures 
check; in column (3), the same orders check; in column (4), the first and second, the third 
and fourth, and the fifth and sixth figures check. While not independent checks, they 
nevertheless give confidence in the accuracy of the work as it proceeds. 

If preferred, the 6*s instead of the /9’s could first be calculated, by using a similar table 
and the general formula 


bii.k 


bii — hikbki 
1 — 


order fi’a in the illustrated trivariate frequency distribution. 
The entries in column (1) of the table are obtained from Table 53. 
Bearing in mind the symmetry in the formula shown at the head 
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of Table 54, the zero-order r% which are also the zero-order 
are copied in the order in which they occur in the formula. 
Consequently, the entries in column (2) are the products of the 
r^s in the second and third lines of each trio of r^s in column (1). 
The entry in column (3) is the first r of each trio minus the 
entry in column (2). The 1 — in column (4), which may be 
found by using a sine table, is for the third r in each trio of r^s 
in column (1). Thus, if the trios of r^s are properly arranged 
in column (1), which can be done by following the general formula 
at the head of the table, the symmetry of the work sheet facili- 
tates all necessary calculations. In using this work sheet, the 
first step is to write in column (5) the subscript for the first- 
order that is to be calculated; this subscript then determines 
the order of the zero-order r’s in column (1). The value of the 
first-order /S, entered in column (5), is found by dividing the entry 
in column (3) by the corresponding entry in column (4). 

The coefficients of partial correlation arc readily calculated 
from the as follows:^ 

^ 12.3 ~ ^ 12 , 3 ^ 21.3 

= 0.80673(0.86417) = 0.69715 
7*12.3 = 0.83496 

7 * 1 . 3.2 ~ ^ 13 . 2 ^ 31.2 

= 0.14997(0.49135) = 0.07369 
ri3.2 = 0.27146 

“ ^23.1^32,1 

= 0.05021(0.15358) = 0.007711 
7*23.1 ~ 0.08782 

^ The coefficients of partial correlation could be checked by using any 
one of several formulas, as follows: 


fij.k = 


ra — rikVjk 


r%i.k — / 3 * 7 . a » 


\/i - Vi - '•% 

CTi.k 


Tij.k — Oij.k 

(Ti.k 


Vi - 


rik 


These formulas all have the advantage that they determine the positive 
or negative sign of the partial r; but the partial r always has the same sign 
a? ite corresponding j3. Cf, also p. 460. 
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Thus is determined the arithmetical value of the first-order 
coefficients of partial correlation. Each coefficient of partial 
correlation is positive if the /3’s from which it is derived are both 
positive, negative if the jS’s are negative. The respective pairs of 
jS’s involved are never of opposite sign. 

The h statistics are calculated from the jS’s as follows: 


0.80673(0.92903) 


0.14997(0.46626) 


0.86417(1.07639) 


0.06021(0.50187) 


0.49135(2.1447) 


0.15368(1.9925) 


The first-order o statistics are calculated as follows: 

-- “ hik.j^h 

ai.M - 217.4074 - 0.74948(204.074) - 0.06992(615.4816) - 28.4166 

ot.i, - 204.074 - 0.93020(217.40740) -0.02520(516.4816) - -11.148 
o,.n - 616.4816 - 1.05380(217.4074) -0.30601(204.074) = 223.929 

The equations of the three planes of regression are, therefore, 
as follows: 

X [ = 28.42 -I- 0 . 75 Xt + 0 . 07 X , 

X't = -11.16 + 0.93Zi + 0.025X, 

Xi = 223.93 + 1.05Xi + 0.31X* 


hij.k = ffij.ic 




bli.i = 


Vl 

0-2 


= 0.80673 
= 0.74948 


618.2 = 0.14997 
= 0.06992 

621.3 = 0.86417 
= 0.93020 

623.1 = 0.05021 
= 0.02520 

h - ndoi^.; 94-2300 

631.2 - 0.49135 43;9354 

= 1.05380 
632.1 — 0.16358 
= 0.30601 


43.9354 

47.2915 

43.9354 

94.2300 

47.2915 

43:9354 

47.2915 

94.2300 


94.2300 

47.2915 
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If Xi^is considered the dependent variable, it can be estimated 
from the first equation; if Z2 is considered the dependent vari- 
able, it can be estimated from the second equation; if Xs is 
considered the dependent variable, it can be estimated from the 
third equation. The second-order standard deviations, respec- 
tively, about the three planes of regression may also be 
calculated.^ 

" 4)(1 - rl,;) 

<^ 1.28 ~ ^1(1 ^12) (1 “ ^13.2) 

= 1,930.3155(0.19762)(0.92631) 
cri.23 = 18.7975 
^2,iz ~ ^ 2(1 ^ 12 ) (1 ^ 23 . 1 ) 

= 2,236.4883(0.19762)(0.99229) 
o‘2.i8 ~ 20.9416 

<^3.12 ~ ^13) (1 ^23.1) 

= 8,879.2866(0.60444) (0.99229) 

<^ 8.12 ~ 72.9767 

The multiple-correlation coefficients, which also measure the 
goodness of fit of the planes of regression, may now be calculated 
as follows:^ 

RU 

R\,2Z 

Ri.2Z 

R2.IZ 
R2.IZ 

^ They could also be calculated by using Eq. (19), (p. 416). Thus the 
calculation of could be checked not only by using 

= ^’iCl rf|)(l — fj2.s) 
but also by, using the following formula: 

<^1.28 ~ O’lCl ^12,9^12 — Pl9,2Tll) 

• * The calculcrtion of R may be checked by using Eq. (20), p. 417. 



= 353.3585 

= 438.5672 

= 5,325.616 
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p* — 1 _ 5,325.616 _ Q 5998 

"*•1* ^ 8,879.2866 

= 0.4002 
•^8.12 ~ 0.6326 




An all-round check on the various calculations may be obtained 
by using Eq. (31), Chap. XVI, as follows: 

Table 55 

1 “ 4“ 2rki^ii.k^ik.i + 


g.2 

1 “ "b i®18.2 4* 2r23 ^12.8 ^18.2 H ^ 

(0.80673)* (0.14997)* 2(0.59373) (0.80673) (0.14997) 

1.0000 = 0.65081 4- 0.02249 4- 0.14367 4- 0.18304 

^2 ^ 

1 — ^28.1 4“ 02l.i 4“ 2rj3 j923.1 ^21.3 + 

<^2 

(0.05021)* (0.86417)* 2(0.62894) (0.05021) (0.86417) 

1.0000 = 0.00252 + 0.74679 4- 0.05458 4- 0.19608 

^2 

1 = /382.1 4“ l3h.2 4- 2ri2 i382.1 ^31.2 4" “ ^ 

<^8 

(0.15358)* (0.49135)* 2(0.89576) (0.15358) (0.49135) 

1.0000 = 0.02359 4“ 0.24142 4- 0.13519 + 0.59978 

Interpretation of Results Illustrated. The interpretation of 
the above statistics of a trivariate frequency distribution may be 
illustrated by assuming that it is desired to predict Xi, the 
second-semester grades of freshmen at the woman’s college 
selected. From the equation for the plane of regression of Xi 
on X 2 and X3, namely, X[ = 28.42 4- 0.76X2 4" O.O7X3, esti- 
mates may be made of a freshman’s grade in second-semester 
English if her grades in the verbal scholastic-aptitude test and 
in first-semester English are known. 

Estimates Based on Regression Equation, If a freshman’s 
grade in first-semester English were 300 and her grade in the 
verbal scholastic-aptitude test were 600, her second-term English 
grade would be estimated at 

X; = 28.42 4- 0.75(300) 4- 0.07(600) . 

= 28.42 + 225 + 42 
= 295 . 
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Siiyse the second-term English grade will, of course, be affected 
by other factors, the student^s actual grade in second-semester 
English will deviate from estimates based upon the regression 
equation. This raises the question as to how much, on the 
average, it can be expected that estimates based on the regression 
equation will deviate from the actual values. The answer is 
found by the determination of the value of 0 - 1 . 23 , which has been 
found above to be 18.8, or approximately 19. The standard 
deviation of the differences between the actual grades and esti- 
mates based on the above regression equation is therefore about 
19. If this regression equation and first-order standard devia- 
tion are typical of these college grades and if the differences 
between actual and estimated values are in general normally dis- 
tributed, the chances are about that the actual value in any 
particular case will fall within limits ± 38 ( = 20 - 1 . 23 ) from the 
estimated value. 

The foregoing conclusion, which is based on the value of 0*1.23, 
can be summarized very succinctly by the calculation of iJi. 23 , 
which has been found to be equal to 0.9039. This is a fairly 
high coefficient of multiple correlation. It shows that the above 
plane of regression is a good fit, and therefore estimates based 
upon it can be expected to be fairly good. 

Partial-correlation Coefficients, Since both b statistics in the 
equation of regression are positive, it is known that the net 
correlations between Xi and X 2 and between Xi and Xz are posi- 
tive. The amount of the net correlation is given by the coeflS.- 
cients of partial correlation ri 2.3 = 0.83496 and ri 3.2 = 0.27146. 
These show that second-semester English grades are much more 
closely related to first-semester English grades than they are to 
verbal scholastic-aptitude test grades. 

Analysis of Variance in Xi. From the and the cross 
products, analysis of the variance in second-semester English 
grades can be made. Thus, from the first set of and cross 
products in Table 66, it is seen that 65.1 per cent of the variance 
in second-semester English grades, Xi, is accounted for by direct 
association with first-semester English grades. Only 2.2 per 
cent is accounted for by direct association with verbal scho- 
lastic-aptijiude test grades, although 14.4 per cent of the variance 
in second-semester English is accounted for by indirect asso- 
ciation with^both first-semester grades and verbal scholastic- 
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aptitude test grades. The variation in other influences accounts 
for 18.3 per cent of the variance in second-semester English 
grades. 

Under conditions existing at the woman’s college studied, it 
appears to be an inevitable conclusion that knowledge of grades 
in verbal scholastic-aptitude tests is not so helpful as might be 
supposed in predicting the subsequent performance of college 
freshmen students. 

Extension of Analysis to Include Four Variables. Additional 
Zero-order Statistics, The extension of the trivariate frequency 
distribution to include a fourth variable X 4 requires first the 
calculation of the mean and standard deviation of the added 
variable. It requires also the calculation of the simple corre- 
lation coeflScients between the new variable and each of the 
other three. For illustration, the fourth variable taken is the 
grade in the College Board English examination. Tables 56 to 
58 are the usual work sheets for a correlation problem. From 
them the necessary data are obtained for calculating the addi- 
tional zero-order statistics, as follows: 

^4 = 549.8889 
Nal = 601,201.9828 
al = 6,187.6788 
<r4 = 78.6618 

Additional First-order Statistics. Among four variables it is 
possible to distinguish four different sets of trivariate frequency 
distributions, each of which will have three planes of regression. 
Accordingly, when four variables are involved the total number, 
of first-order /3 statistics is 24, two for each plane of regression. Six 
of these twenty-four were calculated in Table 54; the remaining 
18 may be obtained by a similar procedure. Table 59 shows the 
24 /S’s for the illustrated four- variable problem, grouped according 
to the four possible trivariate frequency distributions. 

Each of the four trivariate frequency distributions could be 
analyzed as illustrated in the preceding sections of this chapter. 
From the first-order jS’s shown in Table 59 all the other first- 
order statistics may be obtained, by methods already explained. 

In few problems is it necessary or even desirable to calculate 
all 24 first-order statistics of the four trivariate frequency dis- 
tributions involwd in a four-variable set. As may be seen from 


ri4 = 0.49106 
7*24 — 0.48807 
r34 = 0.31551 
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Table 69. — ^Thb First-order /3*s in the Four Trivariate Frcqubncy 
Distributions for Four Variables 
Data on four hinds of grades of 81 college freshmen^ at the Selected Woman's 

College 


First plane 

Second plane 

Third plane 

Trivariate Distribution Xi, X2, ^ 

/812.8 ~ 0.80673 
/8i,.2 = 0.14997 

/»«., = 0.8641 7 
/Sh.i = 0.05021 

1831.2 = 0.49135 
/332.1 = 0.15358 

Trivariate Disirihuiton Xi, X2, X4 

= 0.86124 
* 0.07071 

^ji.4 = 0.86457 
Pu.i = 0.06352 

^41.2 = 0.27260 
1842.1 = 0.24392 

Trivariate Distribution Xi, Xs, X4 

3is.4 = 0.52642 
“ 0.32498 

^31.4 “ 0.62463 
/S.4.1 = 0.00877 

^41.3 = 0.48413 

/S43.1 = 0.01101 

Trivariate Distribution X2, X3, X4 

“ 0.48837 
^24.1 ~ 0.33400 

/3,2.4 = 0.57724 
^34.2 ~ 0.03378 

Pai.z = 0.46447 
P*3.i = 0.03974 


an examination of Table 60, it is possible to calculate all the 
second-order p statistics if only 18 of the 24 first-order p statistics 
are known. If one only of the four planes of regression in the 
four-variable correlation problem is significant or important, it is 
♦necessary to calculate only 8 of the first-order /S statistics. 

Second-order Statistics in a Four-variable Problem, In the 
four-variable correlation problem, statistics for four planes of 
regression may be obtained. Following are the four possible 
regression equations: 

X'l == ai.284 + 5 i 2.84^2 + 5l8.24-X^8 + bu,2iXi 
X 2 = ^2.184 + 521.84X^1 + bzB.uXs -h 624,13X4 

X3 = ^ 8.124 + 631.24X1 + 632.14X2 + 634.12X4 
X4 =s ( 14.123 + 641.23X1 -f- 642.13X2 + 643.12X3 

Also, for each plane of regression a scatter and a/icoeflBicient of 
multiple correlation may be calculated. The procedure is 
similar to that already illustrated; that is to say,4the second-order 
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are first obtained, and from them all the other second-order 
statisti(5s are calculated. Table 60 illustrates the procedure for 
making the necessary calculations to obtain the 12 possible 
second-order /3 statistics. 

Calculation of Second-order Statistics » In a problem where the 
first-order partial coefiicients of correlation are already calculated, 
it is advisable to modify the formula for finding second-order S 
statistics from first-order S statistics as follows : 

According to Eq. (39), Chap. XVI, it was found that 


^ij.kn — 


0ij.k 

1 - 


^in.k^nj .k 

^nj.k^jn.k 


But from Eq. (24), Chap. XVI, it is known that 

^jn.k ~ Pnj.kfijn.k 

Accordingly, the formula for finding the second-order statistics 
can be modified as follows: 


^ij.k ^in.k^ni»k 

Pij.kn — - 2 

^ ^in.k 

In order to secure the greatest convenience in calculation, the 
arrangement of the items in the work sheet (Table 61) is accord- 
ing to the terms of this formula. First the desired subscript for 
the jS statistic to be calculated is entered in column (5); then, 
following the formula, the order in which the required trio of 
first-order appear in column (1) is determined. If this order 
is followed, the entry in column (2) is the product of the second 
two jS’s of the trio in column (1); the entry in column (3) is 
found by subtracting the entry in column (2) from the first P of 
the trio in column (1); the subscript of the third of the trio in 
column (1) is the subscript of the partial r for which 1 — is to 
be found in appropriate tables or, if preferred, calculated. The 
desired second-order )S^s are then calculated, by dividing the 
entry in column (3) by the entry in column (4), and entered in 
column (5). 

In problems for which it is not desired to calculate the first- 
order coefficients of partial correlation, the alternative method 
illustrated in Table 61 may be used. It is to be noted that the 
only differeifbes are that an additional jS must be entered in 
column (1) in each of the sets and that an additional column, 
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TilBLB 60. CAIiCTJLATION OF THE SeCOND-ORDER jS’s FROM THE FlRST- 

ORDER /3’S o 


in which 




fia.h /Stn.JfejSn/.fc 


1 - 


^in.h 


[See Chap. XVI, Eqs. (24 and 39)j 


m 

— m — 

(55 



First-order 0 

Product 

1 term of 
numerator 

Whole 

numerator 

1 — r*/n.Jb 

Second-order fi 

Subscript 

Regression 

statistic 

Subscript 

Regression 

statistic 

12.3 

0.80673 

0.15094 

0.65679 

0.84487 

12.34 

0.77620 

14.3 

0.32498 






42.3 

0.46447 






13.2 

0.14997 

0.00281 

0.14716 

0.99866 

13.24 

0.14736 

14.2 

0.07071 






43.2 

0.03974 






14.2 

0.07071 

0.00507 

0.06564 

0.99866 

14.23 

0.06573 

13.2 

0.14997 






34.2 

0.03378 






21.3 

0.86417 

0.16170 

0.70247 

0.84267 

21.34 

0.83362 

24.3 

0.33400 






41.3 

0.48413 






23.1 

0.05021 

0.00070 

0.04951 

0.99990 

23.14 

0.04951 

24.1 

0.06352 






43.1 

0.01101 






24.1 

0.06352 

0.00044 

0.06308 

0.99990 

24.13 

0.06309 

23.1 

0.05021 






34.1 

0.00877 






31.2 

0.49135 

0.00921 

0.48214 

0.98072 

31.24 

0.49162 

34.2 

0.03378 






41.2 

0.27260 






32.1 

0.15358 

0.00214 

0.15144 

0.98451 

32.14 

0.15382 

34.1 

0.00877 






42.1 

0.24392 






34.2 

0.03378 

0.03474 

-0.00096 

0.98072 

34.21 

-0.00098 

31.2 

0.49135 






14.2 

0.07071 






41.2 

0.27260 

0*01963 

0.26307 

0.92631 

41.23 

0.27320 

43.2 

0.03974 






31.2 

0.49135 






42.1 

0.24392 

0.00169 

0.24223 

0.99229 

42.13 

0.24411 

43.1 

0.01101 






32.1 

0.15358 






43.1 

0.01101 

0.01225 

-0,00124 

0.99229 

43.12 

-0.00125 

42.1 

0.24392 






23.1 

0.05021 





• 
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Table 61. — Calculation op the Second-order ffs from the First- 
• order jS's 

{Alternative method illustrated) 


=- ““ fiin.jSnj.k 

1 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

First-order 0 

Product 

1 term of 
numerator 

Whole 

numer- 

ator 

Product 
term of 
denomi- 
nator 

Whole 

denomi- 

nator 

Second-order 0 

Subscript 

Regression 

statistic 

Sub- 

script 

Hegres- 

sion 

statistic 

12.3 

0.80673 

0.15094 

0.65579 

0.155133 

0.84487 

12.34 

0.77620 

14.2 

0.32498 







42.3 

0.46447 







24.3 

0.33400 







13.2 

0.14997 

0.00281 

0.14716 

0.001342 

0.99866 

13.24 

0.14736 

14.2 

0.07071 







43.2 

0.03974 







34.2 

0.03378 







14.2 

0.07071 

0.00507 

0.06664 

0.001342 

0.99866 

14.23 

0.06573 

13.2 

0.14997 







34.2 

0.03378 







43.2 

0.03974 








If this method is used, the 6*8 instead of the /9’s could be first calculated, using a similar 
table and the general formula* 


^bij.kn 


bi j.k — bin.kbnj.k 
1 — bnj.kbin.k 


column (4), is required in which to enter the product term of the 
denominator. The item in column (5) is then obtained by 
taking the complement of the corresponding entry in column (4). 
The second-order is found by dividing the entry in column (3) 
by the entry in column (5). For convenience of arrangement, 
the product term of the numerator is written in the order 
Pin.kPnj\k rather than Pni.kfiin.h ^.nd the product term of the 
denominator is arranged in the order Pnj.k/Sjn.k rather than 
Pjn.kl3nj\k- Except for the convenience in arrangement of the 
work sheet, the order in which such product terms occur is 
immaterial; but, when arranged as indicated, once the subscript 
of l^e desired second-order jS is entered in column (6), the order 
in which the flrst-order /S’s occur in the equation may be followed 
in entering them in column (1). There are onlyTour first-order 
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iS’s in each set, for the third (in the numerator) is repeated in the 
first part of the product term of the denominator. When this 
procedure as to arrangement in the work sheet is followed, the 
entry in column (2) is always the product of the two middle jS’s 
in the set of four in column (1), and the entry in column (4) is 
always the product of the last two /3^s entered in column (1). 

The second-order coeflBicients of partial correlation are cal- 
culated from the second-order iS^s as follows:^ 

~ Pij.knSji-kn 

or, for the four-variable case, 

'^ij.kn ~ fiii,kfSix.kn 

^?2.34 = 0.77620(0.83362) = 0.647066 
ri2.84 = 0.80440 

^?3.24 = 0.14736(0.49162) = 0.072445 
ri8.24 = 0.26916 

^?4.23 = 0.06573(0.27320) = 0.017957 
^14.23 “ 0.13400 

^24.18 = 0.06309(0.24411) = 0.015401 
^^24.18 ~ 0.12410 

r\zM = 0.04961(0.15382) = 0.007616 
^23.14 ” 0.08728 

^l4.i2 = -0.00098( -0.00125) = 0.000001225 
^ 34.12 = — 0.00111 

(The negative sign of the partial r is determined by the negative 
sign of the corresponding jS statistic.) 

The h statistics of the second order are calculated from the 
second-order in the same way as the first-order 6’s from the 
first-order /3's, by the formula 

^ij.kn ~ fiij,kn — 

or, for the four-variable problem. 




ANALYSIS OF A MULTIVARIAT.E FREQUENCY 461 


= 0.77620(0.92903) 

* = 0.72111 

618.24 = 0.14736(0.46626) 

= 0.06871 

6i4.28 = 0.06573 = 0.06573(0.55864) 

= 0.03671 

521.84 = 0.83362(1.07639) 

= 0.89730 

628.14 = 0.04951(0.50187) 

= 0.02485 

624.18 = 0.06309 = 0.06309(0.60120) 

= 0.03793 

684.12 = -0.00098 = -0.00098(1.19791) 

= -0.00117 

681.24 = 0.49162(2.1447) 

= 1.05438 

632.14 = 0.15382(1.9925) 

= 0.30650 

641.23 = 0.27320 01^^) = 0.27320(1.79040) 

= 0.48914 

642.13 = 0.24411 = 0.24411(1.66334) 

= 0.40604 

648.12 = -0.00125 0.00125(0.83478) 

= -0.00104 

It will be noted that, with the exception of those involving <r 4 , 
the standard-deviation ratios used in the above calculations 
have all beta computed and may be copied from the preceding 
section, where the first-order 6's were calculated from the first- 
order /3’s. 

The second-order a statistics are calculated as follows: 

(^,jkn ~ X% hij.knXj hik.jnXk ““ hin.jkXn 

ai.284 = 217.4974 - 0.72111(204.074) - 0.06871(515.4816) 

~ 0.03671(549.8889) 

= 14.64244, 
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oa.u4 = 204.074 - 0.89730(217.4074) - 0.02485(615.481/5) 

- 0.03793(649.8889) 

= -24.67266 

03.1*4 = 515.4816 - 1.05438(217.4074) - 0.30650(204.074) 

+ 0.00117(649.8889) 

= 224.34628 

04.*3i = 549.8889 - 0.40604(204.074) + 0.00104(615.4816) 

- 0.48914(217.4074) 

= 361.22013 

The equations for the four planes of regression may now be 
written as follows: 

X\ = 14.64 + 0.72lXi + 0.069X8 + 0.037X4 
Xj = -24.67 + 0.897Xi + 0.026X8 + O.O38X4 
X'a = 224.35 + 1.05Xi + 0.306X* - O.OOI2X4 
X; = 361.22 + 0.489Xi + 0.406Xa - O.OOlX* 

If Xi is considered the dependent variable, it can be estimated 
from the first equation; if X* is considered the dependent variable, 
it can be estimated from the second equation; if X* is considered 
the dependent variable, it can be estimated from the third 
equation; if X4 is considered the dependent variable, it can be 
estimated from the fourth equation. The standard errors of 
estimate, that is, the scatters, respectively, about the four 
planes of regression may also be calculated.^ 

^■1.284 — *■1.23(1 “ >’14.23) 

= 353.34(0.98204) 

= 346.9940 
“ 18.628 

*2.184 “ *2.13(1 — r|4.i3) 

= 438.53(0.98460) 

= 431.7766 
*2.184 ~ 20.779 

*8.124 — *8.12(1 ~ >”14.12) 

= 5,325.66(1.00000) 

= 6,325.56 
*8.124 « 72.976 

‘ For alterative methods, see p. 416 and Eq. (19), Qhap. XVI. 
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®’ 4.123 — <>'4.12(1 ~ >"43.12) 

= 4,622.8210(1.00000) 
= 4,622.8210 
<^4.123 ~ 67.991 


The multiple-correlation coefficients, which measure the good- 
ness of fit of the planes of regression, are calculated in the same 
way as for the tri variate problem, namely/ 


Rh 


.jkn 


Rl. 


234 


Rl.Mi 

^2.134 


^2.134 

•^3.124 


R 9 .IU 

•®il23 

l'?4.123 


1 <><•/*« 

<>1 

346.994 

"1,930.3155 

0.8202 

0.9056 

1 _ 431-7766 
^ 2,236.4883 

0.8069 
0.8983 


1 _ J>3^^6 
8,879.2866 
0.4002 
0.6326 

_ 4,622.8210 
^ 6,187.6788 

0.2529 
0.5029 


1 - 0.17976 


1 - 0.19306 


1 - 0.5998 


1 - 0.74710 


For the four-variable problem, the equation for the ^ squares 
and /3 cross products is as follows: 


fiij.kn “1” fiik.in H” Pin.jk ”1” ^^ikffii.knfiik.jn “1" ^I’jrfiii.kiSin.ik ^ 

+ 2rk,fiik.A.ik + = 1 


In Table 62 some of these checks are illustrated. 

Interpretation of Results Illustrated. The interpretation of 
the above statistics of a four-variable frequency distributiot 
may be illustrated by assuming that it is desired to predict thi 
second-semester English grades of freshmen at the woman’i. 
college selected; in other words, the Xi is assumed to be the 

• For an alternative method, see Eq. (20), Chap. XVI. 
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dependent variable. From the equation for the plane of regres- 
sion of Xi on X 2 , Xa, and X 4 , namely, 

X[ = 14.64 + 0.721X2 + 0.069Xa + 0.037X4 

estimates may be made of a freshman^s grade in second-semester 
English if her grades in the verbal scholastic-aptitude test, in 
College Board ' English, and in the first-semester freshman 
English course are known. 

Estimates Based on Regression Equation, If a freshman's 
grade in first-semester English is 300, in the verbal scholastic- 
aptitude test 600, and in College Board English 500, her second- 
semester English grade is estimated as follows: 

X'l = 14.64 + 0.721(300) + 0.069(600) + 0.037(500) 

= 14.64 + 216.3 4- 41.4 + 18.5 
= 291 

Since the second-semester English grade will, of course, be 
affected by other factors, the student's actual grade in second- 
semester English will deviate from estimates based upon the 
regression equation. This raises the question as to how much 
on the average it can be expected that estimates based on the 
regression equation will deviate from the actual values. The 
answer is found by the determination of the value of 0 - 1 . 234 , 
which has been found above to be 18.6, or approximately 19. 
The standard deviation of the differences between the actual 
and the estimated grades in second-semester English is therefore 
about 19. If this regression equation and second-order standard 
deviation are typical of these college grades and if the differences 
between actual and estimated values are in general normally 
distributed, the chances are about that the actual value in a 
particular case will be within limits ±38(= 20 - 1 . 284 ) from the 
estimated value. 

The foregoing conclusion, which is based on the value of 
t^i. 284 , can be summarized very succinctly by the calculation of 
B 1 . 284 , which has been found to be equal to 0.9056. 

If this result is compared with the estimate based on only two 
independent variables, it is found that the standard error of 
estimate is aliiiost as large for the plane based on three independ- 
ent variables as the standard error of estimate based on two 



466 STUDY OF BIVARIATES AND MULTIVARIATES 

independent variables.^ In other words, very little increase in 
accuracy was obtained by including the fourth vari&ble into 
the correlation problem. This same conclusion is borne out 
by comparing the coefficients of multiple correlation. Thus 
i2i.234 = 0.9056, while Ri.n = 0.9039, which is nearly as large, 
indicating that the trivariate plane was nearly as good a fit as 
the four-variable plane of regression. 

Partial-correlMion Coefficients. The unimportance of knowl- 
edge of grades in College Board English examinations in pre- 
dicting the grades of freshmen in second-semester English is 
explained also by the small partial-correlation coefficient between 
Xi and Xi when X 2 and Xz are held constant. This partial- 
correlation coefficient is given as ri 4.23 = 0.1340. 

Analysis of Variance in Xi. These conclusions are further 
indicated by the nature of the 0 squares and the 0 cross-product 
terms. From the first equation in Table 62 it is seen that the 
various proportions of variance in Xi are accounted for as 
follows: 

60.25 per cent by correlation with first-semester English grades. 
2.17 per cent by correlation with verbal scholastic-aptitude 
tests. 

0.43 per cent by correlation with College Board English exami- 
nations. 

13.58 per cent by indirect correlation with first-semester English 
grades and verbal scholastic-aptitude tests. 

4.98 per cent by indirect correlation with first-semester English 
grades and College Board English examinations. 

0.61 per cent by indirect correlation with verbal scholastic- 
aptitude tests and College Board English examinations. 
17.98 per cent by correlation with other factors independent of 
first-semester English grades, verbal scholastic-aptitude 
, test grades, and College Board English examinations. 

The small percentages attributable to College Board English 
examination grades, either directly or indirectly, are apparent 
from these statistics. Evidently, under conditions existing at 
the woman’s college, grades on the College Board English 


iC/. p. 461. 
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examination were of little value for predicting how well the 
students "would do in their college freshman English courses.^ 
Another approach to the study of variance in Xi could be 
made as follows: It was noted above for three variables that^ 


‘f* = + ‘rlz.-A.i + <^i. 


23 


For four variables, 


■“ “i" ^18.2<^1.2 “i" ^14.23<^1.23 H” ^1.21 

which may be expressed in proportions as follows: 


1 = r?2 + r!3.2 


^ 1.2 


+ ^14.23 2 


2 2 
^ 1.23 ^ 1.234 




This expression means that the total variance in Xi is com- 
posed of four parts as follows: the part that is due to total simple 
linear correlation with X 2 , the part that is due to partial correla- 
tion with Xz when X 2 is held constant, the part that is due to 
partial correlation with Xa when X 2 and Xz are held constant, 
and the part due to other causes independent of X 2 , Xs, and Z 4 . 

2 

The expression r^g 2 describes the proportion of the variance 

in Xi that is explained as a result of adding X 3 to the regression 

equation, while ri 4 23 describes the proportion of the variance 
^ 1 

in Xi that is explained as a result of adding X 4 to the regression 
equation; the influences of X 3 and X 4 that result from their 
association with X 2 are already contained in rl^crl. By sub- 
stituting the values of the four above terms in the illustrated 
problem, it becomes 


1.00000 = 0.80238 + 0.07369 + 0.017957 

346.9940 

1 , 930.3155 

or 

1.00000 = 0.80238 + 0.01456 + 0.00329 + 0.17977 

^ It will be noted, however, that ru = 0.49 so that approximately 26 per 
cent [ ^ (.49)*] of the variation in Xi may be estimated from knowledge of X4 
alone. • 

* C/. Chap. XVI, Eq. (33), p, 428. 
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From this expression it may be said that 80.2 per cent of the 
variance in Xi is accounted for by total correlation With X^, a 
further 1.4 per cent is accounted for by additional correlation 
with Xj, and a further 0.3 per cent is accounted for by additional 
correlation with Xi, the remaining 18 per cent being due to other 
influences independent of Xj, Xj, and X 4 . In other words, by 
making a four- instead of a three-variable correlation problem, 
that is, by including the College Board English examination 
grades, only an additional 0.3 per cent of the variance in second- 
semester English grades is explained. 



CHAPTER XVIII 
NORMAL FREQUENCY SURFACE 
THE BIVARIATE HISTOGRAM 

The study of frequency surfaces begins logically with a geo- 
metrical representation of a bivariate frequency distribution 
toiown as a “bivariate histogram/' To visualize the histogram 
that would represent the distribution of Table 25 (page 326), 
consider an ordinary checkerboard. Let the side and top of 
the board be calibrated with the class-interval scale shown in 
Table 25, and let 81 checkers be taken to represent the 81 
students. On the checkerboard square in the row headed 60- 
and the column headed 120-, let one checker be placed; on the 
square in the row headed 100- and the column headed 60-, let 
two checkers be placed; on the square in the row headed 100- and 
the column headed 100-, let one checker be placed; and so on, 
until all the squares on the checkerboard for which there are 
frequencies in Table 25 are covered with the proper number of 
checkers piled on top of each other. 

If the checkers were square rather than round, they would 
stand up better and fill in all the area, helping to support each 
other. If they were square, the resulting figure would resemble 
a histogram for the given bivariate frequency distribution. A 
picture of what such a histogram would lool^ like is given in 
Fig. 121. 

In the foregoing example the heights of the various piles of 
checkers represented the frequency of each cell. It would be 
possible however, so to adjust the vertical scale that the heights 
of the piles of checkers represented the relative frequency of 
each cell. If the checkers were square, giving a histogram 
proper, then, further, it would be possible to adjust the vertical 
scale so that the volume of each pile of square checkers measured 
the relative frequencies. For example, since the class intervals 
are 20 units* each and the area of any cell is thus 400 square 
units, the height of a pile of checkers taken to measure a rela- 
• 469 
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tive frequency of, say 0.08, would be 0.0002 unit. This would, 
of couree, be very small; but then, in any model, the vertical 
unit could be taken sufficiently large to offset this. That is, 
instead of letting i inch represent 1 unit (the thickness of one 
checker, say), it would be possible to let 10,000 inches represent 
1 unit. Then 0.0002 units would be the equivalent of a pile of 
eight checkers. 



Fig. 121. — Histogram representation of a bivariate frequency distribution. 
Rectangular blocks on the other side of the mean point are presumably obscured 
from view. 

Suppose, now, that a histogram is constructed so that volumes 
of the square checkers erected on each cell represent the relative 
frequency of that cell, and suppose that the number of cases is 
indefinitely increased and at the same time the size of the class 
intervals is made infinitesimally small. The result would be a 
solid figure the top of which would tend to trace but a smooth 
surface. This would be a frequency surface. A frequency sur- 
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face is jbhus the limit approached by a bivariate histogram as the 
number of cases is indefinitely increased and the sizes of the class 
intei*vals indefinitely reduced. If an area is traced out in the 
X 1 X 2 plane, the relative frequency of cases falling in this area is 
given by the volume under the surface over that area. 

FREQUENCY SURFACES 

Frequency surfaces may assume all sorts of shapes. They may 
be symmetrical and bell-shaped, or they may be distorted by 
skewness or excessive peakedness or flatness, depending on the 
types of forces underlying the variation in the two variables. 
First will be considered the case of a bivariate surface for variables 
that ate normally distributed and are independent of each other. 

Bivariate-surface, Independent Variables. A monovariate 
frequency distribution, it will be recalled, showed the relative 
frequency of occurrence of various values of a given variable. A 
joint, or bivariate, frequency distribution shows the relative 
frequency of occurrence of various pairs of values of the two 
given variables. Suppose, for example, that a marksman is 
shooting at a target. The scatter of dots about the center of 
Fig. 122 may be taken to illustrate the results of a large number 
of such shots. The position of any particular shot relative to 
the center of the target may be indicated by the amount of its 
horizontal deflection (call it and by the amount of its vertical 
deflection (call it a;i). The relative frequencies of various types 
of shots may consequently be indicated by the relative fre- 
quencies of various combinations of horizontal and vertical 
deflections, that is to say, of various pairs of values of Zi and X 2 . 

The relative frequency of shots in any given area of the target, 
the X 1 X 2 plane shown in Fig. 122, may be indicated by the density 
of shots in that area or by the volume of some frequency surface 
constructed over the 0 : 10:2 plane. The use of the surface for this 
purpose is illustrated in Fig. 123. 

It will be noted that the shots tend to be distributed sym- 
metrically around the center of the target. No tendency for 
large vertical deviations to be associated with large horizontal 
deviations in^either a positive or a negative direction is evident. 
Also, no tendency for vertical deviations to vary in any par- 
ticular way witl^ horizontal deviations is apparent. 
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Normal Bwariate-aurface, Independent Variables. Figure 123 
illustrates the normal bivariate-sUrface independent variables; it 



Fio. 122. — Distril^VLtion of shots at a target, representing a symmetrical bivariate 

distribution. 



Fio. 123 . — X normal bivariate frequency surface, independent variables. [Here 

and the example described above illustrate the characteristics 
of a bivariate distribution where there is no correlation between 
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the two variables. This may be summarized as follows: There 
is no correlation, that is to say, the variables are independent of 
each other, because (1) for any given value of Zi, the distribu- 
tion of values of X 2 is the same, with the same mean and standard 
deviation, as for any other value of Xi; (2) for any given value 
of X 2 , the distribution of values of Xi is the same, with the same 
mean and standard deviation, as for any other value of X 2 . 
When each variable is the result of a set of forces that will produce 
a normal frequency distribution in that variable alone and when 
the two sets of forces operate independently of each other, the 
result will be a normal bivariate frequency distribution with no 
correlation. The easiest way of generating a normal bivariate 
frequency surface is to suppose 1!hat a form of the normal frequency 
curve is held in a position perpendicular to the base plane, as in 
Fig. 124. A knob is fixed to the top of the frequency curve at 
B, and the center of the base 
line of the frequency curve is 
fixed at A, so that it can revolve 
but so that the line BA always 
remains perpendicular to the 
base plane CD. 

If the form of this normal 
frequency curve is revolved in a 
complete cycle until it reaches 
its original position again, the C 
frequency curve will describe ^ 

the surface of the bivariate 124 '-The normal curve, 

normal frequency surface for in- revolution of which will produce 
dependent variables, and it will 

be like a system of symmetrically concentric circles such as 
that shown in Fig. 123. For such a distribution of pairs of 
observations Zi and Xj, r = 0, for the XiXi products are dis- 
tributed equally in the four quadrants, minus products canceling 
plus products. 

Mathematical Representation of Normal Bivariate-surface, 
Independent Variables. Use of th£ and Slt.as the Origin. As 
noted in the discussion of Fig. 122, the various XiX* points 
plotted in a bivariate plane may, with no difficulty, be described 
in terms of fiieir distances from the respective means. This has 
the effect of shifting the axes so that the new axes are the lines 


B 
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drawn perpendicular to the means of the respective scales; the 
vertical line drawn through X 2 in Fig. 122 is the a;i-axis, and the 
horizontal line drawn through ^1 in Fig. 122 is the a; 2 -axis. 
Vertical and horizontal deviations from the center of the circle 
are xi and Xz variates. For many purposes it is more convenient 
to use this method of describing points in a bivariate plane than 
to use the original scales as the point of reference. In the follow- 
ing pages, the more frequent appearance of Xi and X 2 j instead of 
the capital letters, will be understood to signify the shift from 
reference to the original axes to. reference to the axes with the 
origin at the means of the two variables. 

Probability of Each Variate Taken Separately. If Xi is a nor- 
mally distributed variate above and below the Xi and is com- 
pletely independent of X 2 , the probability or relative frequency of 
any value of xi between xi and Xi + dxi, whether associated 
with large or with snlall values or with positive or negative 
values of X 2 , will be given by 

•I gi* 

dP(xi) ~ y=e (1) 

ffi v2ir 

Similarly, if xj is a normally distributed variate and is completely 
independent of Xi, the probability or relative frequency of any 
value of xs between x* and xj + dxt, whether associated with 
large or with small values of Xi or with positive or negative 
values of Xi, will be given by 

1 _ 

dP(x2) = — ^''dxz (2) 

(Ti v2ir 


Joint Probability of Two Variables. The joint probability or 
joint relative frequency of an xi between xi and xi + dxi occur- 
ring in association with an X 2 between X 2 and X 2 -f dxt is the 
product of the above two probabilities. In other words, the 
joint probability of the two variables occurring in pairs of any 
combination is given by 
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which i^educes to the following form: 



Fia. 126. — A normal bivariate frequency surface, independent variables. [Here 

<ri > 

Geometrically, the dP{xiX 2 ) expressed in Eq. (4) describes the 
volume of a column with breadth and width of dxi and dx^ and a 

height equal to - ^ e . Such a column is shown at 

P in Fig. 125. 

The normal bivariate surface may be described, therefore, 
as follows : 



If the twj) standard deviations are equal, the normal prob- 
ability surface is circular like Fig. 123. Horizontal planes 
parallel to the base will intersect the figure in the form of circles 
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becoming smaller as the plane is elevated. Any vertical plane 
parallel with the a;i-axis (a line through the will intersect the 
figure in the form of a normal 
curve with a standard deviation 
equal to ai; and any vertical 
plane parallel with the ajj-axis will 
intersect the figure in the form of 
a normal curve with a standard 
deviation equal to 0 - 2 . If the two 
standard deviations are equals 
these normal curves will be 
identical. 

If, however, the two standard 
deviations are not equal, the 
normal bivariate surface will be 
elliptical in form, as shown in 
Fig. 125, rather than circular. 

Vertical planes drawn as before 
will nevertheless bisect contours 
of normal curves. The vertical 
normal curves will have standard 
deviations equal to <ri, and the 
horizontal normal curves will have standard deviations equal to 
0 - 2 . Horizontal planes parallel to the base in Fig. 125 will inter- 
sect the figure in the form of ellipses, which will become smaller 
as the plane is elevated. Figure 126 is the sort of ellipse that 
would be obtained by the intersection of a plane horizontal to the 
base plane of Fig. 125. The equation for the ellipse shown in 
Fig. 126 is 

0,3x1 + 6.7x1 - 32 = 0 
or 

*1 = ± - W 

Pairs of Xi and Xt that satisfy this equation are: 



Xi 

0 

±10.3 

±0.5 

±10.0 

±1.0 

± 9.2 

±1.5 

± 7.5 

±2.0 

± 4.2 

±2.18 

0 



Fig. 126. — A horizontal section of 
the frequency surface of Fig. 125. 
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Bivariate-surface, Correlated Variables. Instead of two inde- 
pendent variables, suppose there is a set of paired variables in 
which is displayed a marked tendency for positive correlation, so 
that large values of Xi are associated with large values of Xs, 
and vice versa. This is the same as to say that positive values 



of xi occur predominantly with positive values of Xt and negative 
values of Xi occur predominantly with negative values of x*, the 
small x’s measuring in each case the deviations from respective 
means. A^ume that each distribution taken separately is a 
S5rmmetrical one like o in Fig. 127 and o in Fig. 128. In Fig. 
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127 let a represent the frequency curve of the total distribution 
of the X2 variable. Then suppose this frequency distribution of 
all the variants of the variable X2 is cross-classified into three 
groups, (1) those associated with large values of Xi, (2) 
those associated with the ordinary or average range of values of 
Xi, and (3) those associated with small values of Xi. 



Fig. 128. — Vertical view of correlated variables. 


The plane is accordingly divided vertically into three parts 
representing the range of (1) large values of Xi (this part of the 
plane is labeled in Fig. 127); (2) ordinary or average range of 
Xi values, represented in the jSgure by 7; and (3) small values of 
Xi, represented by b in the figure. 

By summarizing init group those variates of X2 associated with 
large values of Xi (those in the range of in Fig. 127), and under 
the assumption that large values of X2 are associatecl with large 
values of Xi, a frequency distribution like 6, whose mean would 
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be larger than the of the total population of variables Z’2, 
would be obtained. The line A A' intersects the base of the 
frequency curve h at its mean point. 

By summarizing in a group the X 2 variables in the y range of 
Fig. 127, a frequency distribution of X 2 variables like c would be 
obtained; then the one showing the X2 variables associated with 
Xi in the range of 5 would give a frequency distribution like d. 
The line AA^ in Fig. 127 also passes through the mean of the 
frequency curve d. In other words, the means of curves 6, c, and 
d, all lie on the same straight line, AA\ 

The Xi variable is treated in a similar manner in Fig. 128, in 
which a represents the frequency curve of all of the values of the 
X\ variable. This frequency distribution of all the Xi variables 
is then cross-classified into three groups, (1) those associated 
with small values of X2, (2) those associated with ordinary or 
average range of values of X2, and (3) those associated with large 
values of X2. The plane of Fig. 128 is accordingly divided hori- 
zontally into three parts, representing the range of (1) small 
values of X^ (this part of the plane is labeled in Fig. 128) ; (2) 
ordinary or average range of X2 values, represented in the figure 
by 7; and (3) large values of Xi, represented by b in the figure. 
By summarizing in one frequency distribution the variates of Xi 
associated with small values of X% (those in the range of ^ in 
Fig. 128), under the assumption that small values of Xi are 
associated with small values of X2, a frequency distribution like 
6, whose mean is smaller than the mean of the total population 
of variable Zi, would be obtained. 

By summarizing in one group the Xi variables in the range 7 
of the X2 variable, ’a frequency distribution of Xi variables like 
c would be obtained; the group of Xi variables associated with 
X2 in the range of b will give a frequency distribution like d. 
The line passing through the means of these three frequency 
distributions would be like B5' in Fig. 128. 

Normal Correlation Surface, Correlated Variabjies. A bivariate 
frequency distribution showing the joint variation of two cor- 
related variables would thus appear to be represented by a 
frequency surface that is turned so as to make an angle with the 
Xv and a;2-a:«es. A picture of a normal bivariate frequency sur- 
face for correlated variables is shown in Fig. 129. Figures 127 
and 128 constitute analyses of the frequencies of Fig. 129 that 
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divided the surface into three parts, first up and down and second 
left and right. The three figures, therefore, are an attempt to 
view the same distribution in three different ways. If any cross 
section is taken of the surface represented by Fig. 129, parallel to 
the Xi-axis, the cross section will have the form of a normal 
frequency curve with its mean on the line hh\ Any cross section 



Fig. 129. — A normal bivariate frequency surface, correlated variables. 

of this surface taken parallel to the X*-axis will have the form of 
a normal frequei^cy curve with its mean on the line oo'. Such 
cross sections are similar in character to the frequency curves 
h, e, and d, discussed in connection with Figs. 127 and 128, 
respectively. Typical cross sections are likewise shown in 
Fig. 129. 

Careful study of Figs. 127 to 129 will aid greatly in the under- 
st^uidir^ of the theory of correlation. They serve also as the 
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basis for comprehending the theoretical explanation in the ensuing 
section. 

Derivation of Equation for Bivariate Normal Frequency Dis- 
tribution, Correlated Variables. Equation of a Rotated Ellipse. 
A quadratic equation of the general form •• 


oZ? + 2hXiXi + bXl + 2gXi + 2fXi + c = 0 (6) 


is an ellipse under the following 
conditions -.1 

a6 — A® > 0 and D 9 ^ 0 


where 


D = 


a h g 
hb f 
g f c 


= abc + hgf + gfh 


— — ch^ — 


For example, the equation 

XI - 4 X 1 X 2 + GXl - 24Xi 

+ 64 X 2 + 144 = 0 (6') 

is an ellipse like that shown in Fig. 
130, expressed with reference to the 
large XiX 2 -axes. The center of the 
ellipse is at Xi = 4, X 2 = —4.2 
The equation for an ellipse with 
reference to the axes passing through 
its center is^ 



Fig. 130. — A horizontal 
cross section of a normal bi- 
variate surface, correlated 
variables. 


a'xi + 2h'xiX2 + b'xl + c' = 0 (7) 

where a' = a, h' = ft, 6' = b, and c' = D/{ab -- ft^). 


For Eq. (6') the new form is 

Xi — 4x1X2 + 6x2 — 32 = 0 ( 7 ') 

^ Fine, H. B., and H. D. Thompson, Coordinate Geometry, pp. 137-138. 

* The center of the ellipse is found by solving the following two equations 
for Xi and X^: 

aXi -hhXi+g =^0 
hXi + bX^ -i-/ « 0 

In this problem, a. 1, ft =» —2, gr = — 12, 6 « 6, and / « 32. 

* C/. Fine and Thompson, op. dt. 
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Following are the solutions for Eqs. (6') and (7'), from which 
Fig. 130 was drawn, the two equations describing the same 
ellipse: 


Equation (6') Equation (7') 


Elution: 


Solution: 



X2 

2X, + 12 ± y/-2{Xl + 8X,) 
X, 

X2 

Xi - 2a;2 ± \/32 

- 2x1 

Xi 

0 

12 ± 0 - 12 


0 

0 ± \/32 


5.7 - 5.7 

-1 

10 ± VTi = 13.74 

6.3 

±1 

± 2 ± 


±7.5 + 3.5 

-2 

8 ± V2i ~ 12.9 

3.1 

±2 

± 4 ± 


±8.9 T 0.9 

-3 

6 ± VSO = 11.5 

0.5 

±3 

± 6 ± Vli 

= 

±9.7 ± 2.3 

-4 

4 ± \/32 = 9.7 

-1.7 

±3.5 ± 7 ± V 7.5 


±9.7 ± 4.3 

-5 

2 ± VSO = 7.5 

-3.5 

±4 

± 8 ± V~b 

= 

±8 


The equations of the axes of the ellipse are obtained by finding 
the positive root of X in the following equations: 

fi'X* + (a' - h')\ -h' = 0 

or, in this case, 

-2X2 _ 6X + 2 = 0 

X = 2.85 

The equation for the major axis of the ellipse is therefore 
Zi = 2.85x2, and the equation for the minor axis is x* = — 2.86xi. 

Referred to its own major and minor axes, the equation of the 
ellipse is Ax\ + Bx\ + C = 0, where A and B are obtained from 

4 + R = a' + 6' AB = a'V - A'* C = c' 

and the condition that A — B has the same sign as h'. For this 
ellipse it is thus found that A — 0.3 and B = 6.7. The equa- 
tion for this ellipse referred to its own axes (see Fig. 125) is 

0.3x? + 6.7xi - 32 = 0 

Mathematical Representation of a Bivariate Normal Correlation 
Surface. It was noted above that the bivariate normal surface 
in which Xi and xs are independent of each other (that is, in 
which no correlation exists between them) is of the form 

dP(xiXs) = g — e ® dxi dx 2 
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The constant term, l/27r(7i(r2, js a constant dependent on the 
values of the two standard deviations in any particular instance. 

The product of this constant times the term e gives, 

for various values of xi and X 2 , the height of the bivariate surface 
from the base (the djistance OP in Fig. 125). If a horizontal plane 
parallel with the base plane is drawn through the normal bivariate 
surface at a distance OP from the base plane, the intersection of 
the plane and the bivariate surface will be an ellipse (as in Fig. 
126) if the standard deviations are unequal; the intersection will 
be a circle (as in Fig. 123) if the two standard deviations are equal. 
Such a plane represents the locus of all points distant OP from the 
base plane, and the passing of such a horizontal plane through 
the bivariate surface is equivalent to setting the expression 

e czO equal to a constant which is equivalent to putting 


4- - 

(t\ al 


This equation represents a circle if 0*1 = 0-2 and an ellipse if 

(Ti 5^ 0-2. 

The smaller the constant c, the smaller will be the circle or 
ellipse until, at the peak of the bivariate surface a very small 
circle or ellipse will be found — ^finally, just a point. 

If the two variables are correlated, two changes occur. (1) 
The ellipse is rotated. (2) The ellipse is narrowed. If before 
correlation the surface is circular in form, owing to the fact that 
the standard deviations are equal, the existence of correlation 
will cause the circle to be converted into a rotated ellipse, narrow- 
ing the circle to an elliptical form. If before correlation the 
surface is elliptical in form, owing to the fact that the standard 
deviations are unequal (see Fig. 126), the existence of correlation 
will cause the ellipse to rotate and also to become narrower. 
This phenomenon is explained as follows: 

If larger than average values of Xi cause X 2 to be larger than 
average and smaller than average values of Xi cause X 2 to be 
smaller than average, the pull exerted on X 2 valuesds ‘indicated 
by the arrows in Fig. 131. The larger the Xi, the more pull 
will be exercised upon X 2 to make it larger than its average. 
This is indicated by making arrow (1) longer than arrows (2), 
(3), and (4), which, respectively, represent the degree to which 



484 STUDY OF BIVARIATES AND MULTIVARIATES 


successively smaller values of affect values of Z 2 , until, by 
the time Xi becomes smaller than average (below the line Xi), 
arrow (4') points to the negative pull, that is, causing X% to be 
less than its average. 

When correlation exists, this means that bivariate frequencies 
located in quadrant II, where Xi is larger and X 2 is smaller 
than average, tend to move over to quadrant I, where X% and 
X\ are both larger than average. Bivariate frequencies already 
located in quadrant I are less affected. Similarly, bivariate 
frequencies in quadrant IV tend to move to quadrant III, where 



Fig. 131. — Illustrating the difference between the nonexistence and existence of 
correlation in a normal bivariate frequency surface. 


both Xi and Xi are smaller than average, while bivariate fre- 
quencies in quadrant III are less affected. The result is that 
the rotated ellipse becomes narrowed as shown in the part of 
Fig; 131 at the right. Any horizontal plane parallel to the base 
of a correlated bivariate (Fig. 129) will intersect the bivariate 
frequency surface in the form of an ellipse such as that shown in 
the right half of Fig. 131 — ^large ellipses near the base plane, and 
smaller and smaller ellipses as the horizontal plane is raised 
higher and higher from the base. These ellipses have the 
equation 


ax* -f 2hxiXi + b*| -+• c = 0 
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As already noted, the middle term 2hxiX2 is present in the 
equation because of the fact that the ellipse is rotated and now 
described in terms of axes other than its own, although the origin 
remains the center of the ellipse. The middle term is thus 
present because of correlation, which causes the rotation of the 
ellipse. This middle term is generally called the product term 
because it is the product of the two variables. When there is 
no correlation, this middle term disappears.^ The narrowing of 
the ellipse, as will be seen, results in the increase in the value of 

the constant term « 

ZlTff i0’2 

Since the normal bivariate surface in which Xi and X 2 are 
correlated is thus elliptical in form but rotated and narrower than 
the elliptical surface representing uncorrelated bivariates, the 
distribution of probabilities or relative frequencies will be given 
by an expression of the form 

dPiXiXi) = dxi dX2 (9) 

This is the general formula for a normal bivariate frequency 
distribution of correlated variables. The remainder of the 
argument, which appears in the Appendix to this chapter, shows 
how the parameters A:, a, A, and h may be evaluated in terms of 
the moments of Xi and X 2 . When the proper values of the 
parameters are inserted, the formula is as follows 

dP(xiX 2 ) = ; e 2(1 - r*) V-i* ti., dx^dx^ (10) 

2‘k<t\(T2 \/l “ r 

This probability expression describes a normal bivariate 
frequency distribution such as that graphed in Fig. 129. The 
rotated position is reflected in the fact that the exponent of e 
has a middle ^‘product term.” The fact that the surface is 
narrower than it would be if there were no correlation is reflected 
in the character of the constant term, which is larger than the 
constant term of a normal bivariate frequency surface of uncor- 

^Seep 475 * 

® See Appendix, pp. 492-496. 
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related variables.^ In other words, because r cannot be greater 
than 1, 



27r<ri<r2 Vl 

The degree to which the constant term in the correlated surface 
is larger depends upon the value of r. If r = 0, the constant 
term becomes identical with the constant term of the uncor- 
related surface. If r = 1, the constant term of the correlated 
surface becomes infinitely large, reflecting the fact that when 
r = 1 the surface becomes so narrowed that it is a plane, all 
points being on the line of regression. 

LINES OF REGRESSION 

In the discussion of Fig. 127 it was pointed out that the line 
AA' passes through the means of frequency distributions a. b, 
and c. Similarly, in the discussion of Fig. 128, it was said that 
the line JBB' passes through the means of frequency distributions 
a, 6, and c. In the discussion of Fig. 129 the line oa' was said to 
pass through the means of any frequency distribution made by a 
vertical plane parallel with the a;i-axis, and the line 66' was said 
to pass through the means of any frequency distribution made 
by a vertical plane parallel with the X 2 -axis. These two lines 
are, thus the progressions of the means for the normal bivariate 
surface. As will be shown shortly, they are also the least-squares 
lines that might be fitted to the surface. In both senses, there- 
fore, they are the lines of regression for the surface. 

If there is no correlation, as illustrated by Figs. 122, 123, and 
126, the two lines of regression correspond with the major and 
minor axes of the ellipse, that is, with the axes represented by 
the Xi and X 2 lines of Fig. 122 or Fig. 126. By hypothesis, in 
the uncorrelated bivariate surface the mean of any frequency 
distribution made by a vertical plane parallel to the xi-axis will 
be on the Xi line, and the mean of any frequency distribution 
made by a vertical plane parallel to the a: 2 "axis will be on the X 2 
line. When the surface is rotated and narrowed, as a result of 
correlation, it is part of the hypothesis that the normal symmetry 

y' 

^ The narrowing is due to a stretching upward of a gived volume. As 
indicated, in the limiting situation (r » 1), the surface becomes a vertical 
plane stretching to an infinite height and having an infinitesimal thickness. 
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of the surface remains and accordingly the means remain in a 
straight line, but a straight line at an acute angle rather than 
perpendicular to the original axis. 

Mathematical Representation of Lines of Regression. The 
bivariate normal correlation surface in terms of probabilities 
has been found to be described as follows: 


dP{xiX^ = 


2n<T\<T2 \/l — r‘^ 


e2(i 


1 /Jt* g X\Xi . X2*\ 

— r2)\ai* a2*/ 


dxidxz (11) 


A line of regression, for example, the line of regression of X 2 on 
rri, is a general description of the law of relationship by which 
for a given value of Xi the most probable value of x^ may be 
determined. Equation (11) describes the joint probability of any 
bivariate 0 ^ 10 : 2 . The probability of any value of X 2 occurring 
with some specified value of X\, say ii, will be as follows: 




1 


- e 


27r<ri<r‘! \/l — 


1 

2(1 — r*)L <rj 


J dxx dX2 


(12) 


If (xiAi)^ is factored from the exponent of e, the equation 
becomes 


dP(fiX2) 


I 

27rflri<72 \/l — 


(xi)» 

W(l-r»> g W "ax T2J ^3.^ 

(13) 


The square of ?| — 2r ^ ~ 
(ri <ri o ’2 


is completed by adding 



which must also be subtracted to keep the value of the whole 
expression unchanged. This subtracted part may be conven- 
iently put with the other (fi)^ term so that the final result of 
these operations is as follows: 


dP{x 1 X 2 ) = 


27r<ri<r2 y/l — 


~(xi)»-r»(*i)* 1 /xt xiY 

p 2ai*(l-.r*) g 2(l~r*)Vai "<ri/ 


dxx dx2 
(14) 


Upon simplifying the exponents and splitting up the constant 
term and the dxidxz^ this expression becomes 
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dP(£iX,) 


<ri 


•v/2ir 


(»t)« 

dxi 


<r» •%/! — r* •\/2jr 


r— 

e" 2»**(l-r«) 

(15) 


Since the first factor is a constant {ii being given), Eq. (15) 
shows that the probability of an *2 for a given value of xi is pro- 
portional to the probability of a normally distributed variate 

whose mean is r — and whose standard deviation is V2 -v/l — r^. 

(It will be recalled that the general equation for the normal 
1 \ 

curve is 7= e ^‘dxj. Accordingly, the most probable value 

ff ■v2r / 

of X2 for specified values of Xi, that is, the line of regression of 
X2 on Xi, is as follows: 


/ 02 
Xt — r — xi 


The standard deviation or scatter about this line is 0-2 \/l — r*. 

From Eq. (15) it is seen that the locus of all points representing 

the means of X2 for a given Xi is X2 = r — £1, which is the equa- 

tion of the line of regression of X2 on Xi. The line of recession of 
xi on X2 is given by interchanging xi and X2 in the above argu- 
ment. As indicated above, these two lines are the same as 
those that might be fitted to the distribution by the method of 
least squares. From Eq. (15) it is also shown that the standard 
deviation of X2 for a given Xi (in other words, the scatter at any 
point of the line of regression) is indep endent of the selected 
value of Xi, for it is always equal to at y/l — r*. 

NORMAL MULTIVARIATB FREQUENCY “SURFACE" 

When a bivariate distribution is described in geometrical 
terms, one of the dimensions can be used to measure the fre- 
quencies. This is not possible for distributions involving more 
than two variables. In the three-variable case, for example, all 
three dimensions must be used to indicate the variations in the 
variables themselves, and none is left to measure the frequencies. 

Reeiort is had in multivariate problems to the use^of densities 
to measure frequencies. Such a device could have l^en used in 
the monovariate or bivariate case; instead of ^ving the fr^ 
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quency of any interval represented by the height of a rectangle 
erected on the interval, it could be assumed that the cases were 
represented by points on a line, and the more points crowded into 
any given interval on the line, i.e., the greater the density of 
points in the interval, the greater would be the frequency of 
that interval. Likewise, in the bivariate case, instead of repre- 
senting the frequency of cases in any given two-dimensional cell 
by the height of a rectangular pile of checkers set up on that cell, 
it would be possible to look upon the various cases as points in 
the two-dimensional plane; the frequency of points in any cell 
would then become the density of points in that cell. 

This is the device used to measure frequencies in the multi- 
variate case. For a trivariate distribution, for example, the 
various cases are looked upon as points in three-dimensional 
space, and the density of these points in any given three-dimen- 
sional cell becomes the measure of the relative frequency of cases 
in that cell. A trivariate frequency ^^surface,’^ if it may be so 
called, is in reality a trivariate density function. The same idea 
may be carried over by analogy to distributions of four or more 
variables, although no graphical representation can actually be 
made of such distributions. 

The properties of a normal multivariate surface'^ or density 
function are merely generalizations of the properties of a normal 
bivariate surface. Whereas in the latter case, loci of equi- 
probability {i.e,, loci of constant level on the frequency surface) 
were ellipses in the XiaJrplane, in the multivariate case loci of 
equiprobability {i.e,, loci of equal density in the iV-dimensional 
space) are ellipsoids in the Xi, X2, . . . , Xat space. A picture 
of a three-dimensional ellipsoid is given in Fig. 132 . This repre- 
sents a contour of equiprobability for a trivariate normal distri- 
bution in which there is no correlation. Similar ellipsoids, some 
larger, some smaller, would represent other contours of equi- 
probability, and the whole distribution could be represented by a 
nest of such ellipsoids. The elliptical contours representing a 
high degree of probability are, of course, the contours close to 
the center, the center itself being the point of maximum prob- 
ability (maximum density). As one goes off from the center 
in a strai^t line in any direction whatsoever, the change in 
probability (density) is in accordance with the normal law. If 
the variables ^are measured in standard-deviation units, the 
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elUpsoids become spheres and the distribution becomes sym- 
metrical in all directions. 

When there is correlation between the variables, the ellipsoids 
of equiprobability becomes tilted with respect to the various axes 
and flattened out. If the variables are measured in standard- 
deviation units, the degree of tilting in any direction is directly 
related to the amount of the correlation between the variables 
concerned. The greater the multiple correlation between the 
variables, the narrower or flatter the ellipsoids become. In 



Fig. 132. — Ellipsoid of equiprobability for a trivariate normal frequency 

distribution. 


the limit in which there is perfect correlation between all the 
variables, the whole distribution reduces to a line through 
the origin at an angle of approximately 54f degrees (cos“‘ 1 \/3) 
with all the axes (assuming the variables are measured in o- units). 

As in the simpler case, a plane or hyperplane of regression 
represents the locus of the mean values of one variable for various 
combinations of the other variables. For a normal multivariate 
distribution, the deviations from any plane of regression are all 
hormally distributed with a constant standard deviation for any 
one plane. 

All the properties of a normal bivariate distribution thus carry 
over to a normal multivariate distribution, the ohly difference 
being that ellipses of equiprobability and lines of /egression noy 
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become ellipsoids and hyperplanes of higher dimensions. Basi- 
cally, the character of the distribution is essentially the same. 

NONNORMAL BIVARIATES AND MULTIVARIATES 

If a bivariate or multivariate distribution does not approach 
the normal form, much of the conventional correlation analysis 
loses its significance. In some cases, by taking logarithms or 
reciprocals a nonnormal distribution may be transformed into a 
normal distribution.^ In some instances, a multivariate dis- 
tribution may be normal with respect to its variations about the 
means of the rows and columns but the means of the rows or 
means of columns may trace out a curve of regression. In other 
instances, the regressions of the means may be linear, or planar, 
but the deviations around these lines, or planes, of regression 
may be either nonnormally distributed or normally distributed 
with varying standard deviations. 

If, in the case of two variables, the regressions are linear, the 
initial arguments presented for the use of the product-moment 
formula for r are still valid even for nonnormal distributions.^ 
Large values of Xi would still tend in general to be associated 
with large values of X2 (or with small values if the correlation is 
negative), and a formula based upon the product deviations 
would give a good measure of the association between the two 
variables. If the distribution of cases around the lines of regres- 
sion is skewed, however, or if the standard deviation varies from 
one part of the line to another, the scatter about the lines of 
regression loses its significance as a measure of typical variability. 
Great care must be taken in these cases in using an average 
scatter to determine the degree of error in an estimate based on 
the line of regression. When the distributions are not normal, 
the rule that two-thirds of the cases tend to lie between plus and 
minus <ruj no longer holds. 

Finally, if the bivariate distribution is not normal, even the 
product-moment formula may cease to be a statistic of special 
significance in characterizing the distribution. In the normal 
case, if the two means, the two standard deviations, and r are 
all known, the bivariate distribution is fully determined. In the 
nonnormal case, other measures similar to measures of skewness 

' See Chap. XV, pp. 377-396. 

* See Chap. XIII, pp. 338-363. 
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and kurtosis in the mohovariate case may be of equal if not 
greater importance in describing the bivariate distribution. 
These considerations should always be borne in mind when r is 
used to measure correlation between nonnormally distributed 
bivariates. 

Similar statements may also be made about nonnormal multi- 
variate distributions. Here the higher dimensionality multiplies 
the. possibilities of skewness, kurtosis, and other departures from 
normality.^ 

APPENDIX 


DERIVATION OF THE EQUATION FOR THE NORMAL BIVARIATE 
FREQUENCY SURFACE, CORRELATED VARIABLES 

The normal bivariate surface in which Xi and are correlated is ellip- 
tical in form but rotated and narrower than the elliptical surface repre- 
senting uncorrelated bivariates. The distribution of the probabilities or 
relative frequencies is given by an expression of the following form: 

dPixiXi) « ( 10 ) 

in which the constants kf a', and 6' may be evaluated in terms of the 
moments of Xi and X 2 . 

First it is to be noted that 


J iP{x\X^ dxi dxz 

= 1 

(i) 

/ iP{x\X^Xi dxi dx2 

* 0 

(ii) 

SiPixiX 2 )x 2 dxi dx 2 

= 0 

(iii) 

JJP{xiX 2 )xl dx\ dx 2 


(iv) 

/ SP{x\X2)x\ dxi dx2 

* al 

(v) 

SfP(xiX2)xiX2 dxi dx2 

= r<ri<r2 

(vi) 


Equation (i) is true since the sum of all probabilities or relative frequencies 
is necessarily one. Equations (ii) and (iii) are true because Xi and X 2 
represent deviations from the means of Xi and Xj. Thus dxi dxz 

is equivalent to ^ Xi, which equals zero. Likewise, 


// 


P(xiX2)x2 dxi dx2 


2 / 


0 


Equation (iv) is another form of ^ xf, which is equal to the variance of 

2)f 

Xi; Eq. (v) is equivalent to which is equal to the variance of X 2 ; 

and Eq. (vi) is equivalent to -^XiXt^ which is equal to to-kts, since 

r » XfxiX 2 /N<ria 2 . ^ 

^For a more complete consideration of the problem of nonnormality, 
see Smith ahd Duncan, Sampling SUUisticai Chap. 18. 
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Second it is to be noted that, with reference to its rotated axes, aa' and 
66', the equation of the ellipse representing the intersection of the fre- 

quency surface by a horizontal plane at a distance e ^ from the base plane 
is as follows: 

Ax' + Bx't = C 

where Xy^ and represent the coordinates of a point with reference to the 
axes aa' and 66', that is to say, Xy measures the perpendicular distance of a 
point from aa' and Xj measures the perpendicular distance of a point from 
66', If the areal element ^ dxi dx 2 is also expressed in terms of the x^x* 
coordinates, it becomes dxi dxt = dx[ dx^. * The whole probability function 
thus becomes 

dP(z[x2) - ( 17 ) 

But this is the form of a normal frequency surface for uncorrelated variables, 
so that, as seen above, pages 474-^5, 



since there is no cross-product term, ll = 0. 


1 It will be recalled that dP{x\X^ = F{x\x^ dx\ dx^ is represented geo- 
metrically by the volume under the surface F{x\X^ cut off by a hollow pipe, 
erected on a rectangle in the X 1 X 2 plane, the sides of which are dx\ and dxt 
(see Fig. 125, p. 475). To express the whole probability distribution in 
terms of the new xjxj coordinates, the area of the pipers base, dx\ dxz must 
be transformed into these new coordinates as well as the height of the pipe, 
F(XiX2). 

* The transformation of coordinates is of the form 


Xi = X 2 sin a + x\ cos at 
X 2 = cos at x[ sin a 


where a is the angle that aa' makes with the X2-axis. C/. Fine and Thomp- 
son, Coordinate Geometry^ p. 120. Since, in general, dxi dx% equals, within 
differentials of higher order, 


5x2 5xi 
5xs 5Xs 
5X2 ^Xi 


dx[ dx\ 




it follows that 


dxi dx2 » 

since cos* a + sin* a « 1. 
133-134. 


cos at sin a 
-^sin a cos a 


dxi dx^ 


dx\ dx!g 


Cf. Wilson, E. B., Advanced Caiculus^ pp. 
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The distribution function, Eq. (17), may therefore be written as follows: 


dP(x[x't) = -7-7^ e 2 

where </i and are the standard deviations of the new variables x[ and X 2 . 
It will be noted that this transformation has not changed the probability 
of a given XiXt combination but has merely expressed it in terms of a new 
set of coordinates. Accordingly, P{xix^ = P^x^x^j where and a4 
derived by a linear transformation from xi and X 2 .* 

Finally, it will be noted that in any equation of the second degree the 
product of the coefficients of the squared terms minus the square of one- 
half the coefficient of the cross-product term is invariant (that is, its value 
remains unchanged) under simple translations and rotations.^ Accordingly, 
the following relationships hold : 




(18) 


or since =* 0, 


AB - m ^ - W" 

AB = a'6' - h'^ 


But inasmuch as A = B = (l/trj)*, it follows that 


AB = = a'6' - 


From this it follows that 


k - 


VaV - h'^ 


Use will now be made of these relationships to derive the values of a', 5', 
and h\ 

As noted above, since 


then 





1 


J m J CO k 1 


2ir 


■y/a'b' - {h'y 


(19) 


If both sides of Eq. (19) are differentiated with respect to o', it is found 
that 

4t 

♦ See footnote ♦, p. 493. 

^ C/. Fine and Thompson, op, cU,, p. 131. 
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— - f f = - i — , 

2 J J 2 la'b' - (fe')’]* 


2 k[a'b' - 

By canceling out — i and multiplying the equation by k, the left side is 
equal to <rf [see Eq. (iv), page 492], and it is found that 

<“> -(Ol 

If similar procedure is followed after differentiating Eq. (19) with respect 
to 6', it is found that 


[a'b' - (*'>)] 


or o' = <r^[o'6' - (A'*)] 


If both sides of Eq. (19) arc differentiated with respect to h\ it is found that 

-// a;ia;2e~Ko'*i*+2V*i*s+i>'-t!*) dxi dxt 

1 2ir( -2A' ) _ ^ _ i-h') 

2\a'l' - (A')2]J ^a’h’ - (A'*)] 

in which, if multiplied through by k^ the left side equals —<ri<r 2 ri 2 [see Eq. 
(vi)], and hence the whole expression reduces to 


[a'6' - 


(c) h' = -(r,<r 2 ri 2 [a'&' - (/i'*)] 

From Eqs. (a), (6), and (c), it follows that 


If f i.f f 

1)* — — a’ a' o' 

<r2 0’2 

Va'ft' - (A')* = — 


Equations (o), (b), and (c) are three equations from which the values of 
o', h'y and may be expressed in terms of <ri, <r 2 , and r. The direct evalua- 
tion of o', 6', and h' from these equations is not a simple matter, however, 
and it is easier to proceed as follows: From Eqs. (o), (6), and (c), it is pos- 
sible to express 6' and h' in terms of o', as noted above in Eq, (d). It will 
also be recalled that 

- (AO* _ o'<ri Vl - r« 
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By substituting equivalent values, Eq. (16) may accordingly be written 
as follows: 

dPixiXt) ■= ^ — -e * dxidxt ( 20 ) 

The double sum JfdP(xiXi) « 1, however, so that, from Eq. (20), it 
follows that 

/i i o, xixt . a?2*\ ^ 

LXLr_L« 2 W 'aiiTt ••'Jdxidxt=-, (21) 

2ira-j a 


If 


If both sides of Eq. (21) are differentiated with respect to a', it is found 
that 


// 


^ A « a'<ri*/xi* * 1*2 , a? 2 *\ 

V ^ - »•* 

2iwj 


o I J J ^ 

— I -7 - 2r + -J 1 dxidXi= - — — 

L2 Vi <ri<r 2 cTj/J (a')* 


Multiplication of both sides by —a' and expansion of terms then give the 
following: 



2\ / ^ /i 9 aVi*/xi* „ iriJft , a;i*\ 

Ml yi - 


21^-9 


d:ri dx2 


0 0 J V / • oVi*/ari» 1 a;**\ 

/ / (r ^ *.x.) e'— 

//S'*)- 


, ^ 4 O'0’l*/xi» _ XlX* , Xt*\ , 

+;7i) ^ i 

27r(rs a' 


But the left side is equal to 

which, according to Eqs. (iv) to (vi), is equal to 


xiX}AF{x\Xi^ + 


&JI 


x\dP{xiX2) 


\ (<rj) - r ~ (rcTKTa) + ^ {cr\) « orJ(l - r*) 

2 O’] 2(72 


Hence, 


^*(1 ~ r*) « 

n ' 


or 


“'-37 


If this value of is substituted in Eq. (20), it will give an equation in 
#hich all the parameters have been evaluted in terms of the moments as 
follows: 


dP{xixii 


2w\tr% \/l ~ r* 




dxi dxt (22) 



PART V 


Study of Dynamic Variability 

CHAPTER XIX 

INDEX NUMBERS 

One of the most widely used statistical methods is the proce- 
dure that gives rise to the summarizing or expression of data in 
the form of index numbers. It is an application to a practical 
problem of simple principles of ready comparison, principles of 
averaging to obtain summary figures, and principles of stratified 
sampling. Today, the method of index numbers is applied in 
five large fields, as follows: 

1. The measurement of the general price level, or the measure- 
ment of general exchange value. 

2. The measurement of groups of prices, such as wholesale 
prices, retail prices, or wages. 

3. The measurement of the general quantity of production or 
trade with indexes of physical production, trade, or employment. 

4. The measurement of the general volume of business or 
trade with indexes of the value of production or trade, or with 
so-called barometers.'^ 

5. Miscellaneous, including a wide variety of uses, some of 
which are given below on pages 511-513. 

History of Index Numbers, General use of the device known 
as an '‘index number" to serve as a comprehensive method of 
summarization is of recent origin. Like most of the modern 
technique of statistics it has been developed since 1900. But 
the fundamental idea is an old one. According to Warren and 
Pearson, as early as 1738 Dutot made price comparisons showing 
that a group of representative commodities cost twelve times as 
much in 1735 as they did in 1508.^ In 1764, an Italian, G. R. 

^ Warren, (Ieoros F., and Frank A. Pearson, Prices (1933), pp. 18r-20, 
containing other interesting examples of attempts to measure changes in 
general price level prior to the middle of the nineteenth century. 
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CarU, attempted an investigation into the effect of the discovery 
of America upon the purchasing power of money; he constructed 
a very simple index number of prices, using only three com- 
modities, grain, wine, and oil. He combined the prices of these 
three commodities in order to compare their average level in 
1750 with the level of the same commodities in 1600.* The 
gold movement from the New World to European countries 
aroused speculation throughout the mercantile period with 
respect to the relationship between prices and the amount of 
money in circulation. Locke and Hume laid the groundwork 
for the statement of what is now known as the quantity theory of 
money. Speculations of the seventeenth and eighteenth cen- 
turies, however, with the exception of Carli’s unusual attempt, 
were without the assistance of any measurement of the general 
price level. 

Concern about the problem was brought to a new height during 
the Napoleonic Wars, when prices were fluctuating widely; and 
again during and following the Greenback era in the United 
States the question of the relationship between the general price 
level and the money supply became associated with inflationary 
issue of paper money. In the decade preceding the Civil War 
the discoveries of gold in California served to arouse interest in 
the question of the effect of increased supplies of gold upon the 
general price level. 

Twentieth-century economists, already interested in the 
quantity theory of money by reason of the accumulation of these 
historical experiences, were provoked to continued and diligent 
study by the development of the South African gold mines since 
the 1890’s, accompanying world-wide general rising prices until 
the First World War. During the First World War and the 
subsequent period of maladjustment, with countries all over the 
world alternately on and off the gold standard, speculation in 
monetary theory became of such general interest that the prob- 
lem preoccupied some economists almost to the exclusion of other 
fields of study. 

^ Meanwhile the statistical technique of measuring general price 
chu)|^ by tbe index-number device was developed; by 1798, 

* MitCHELL, WsisiJiT C., “Index Numbers of Wholesale Prices in the 
TTnited States and Foreign Countries,” Bureau of Labor Statistics, Biilkltn 
2^, p. 7; also reprint of Part I, “The Making and Using of Index Num- 
b^” Bttffetfn 666 (1938), p. 7. * 
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Sir George Schuckburg-Evelyn formulated a plan for making 
index numbers of prices.^ The efforts of the. early statisticians in 
this direction were accorded but scant approval by the econ- 
omists, who were apparently suspicious of ^'political arith- 
metic/^ Ricardo said that it is impossible to determine 'Hhe 
value of a currency’’ by its ''relation not to one, but to the mass 
of commodities.”^ Early in the nineteenth century mathe- 
maticians were more interested in the application of the theory 
of probabilities in the fields of astronomy, biology, anthropology, 
and geology. The great exponents of the developing technique 
in the application of statistical theory to the social sciences, such 
as Qu^telet, were busy with problems in the realm of ethics and 
nnorals; but about the middle of the nineteenth century came 
powerful support for the application of these principles to eco- 
nomic statistics. 

William Stanley Jevons claimed that the works of Qu6telet 
abundantly proved that many subjects in the social sciences are 
so hopelessly intricate that they can be analyzed only by the 
use of averages and by trusting to probabilities as the form of 
generalization. He constructed indexes of wholesale prices in 
order to measure the value of gold and invoked the theory of 
probabilities as justification of his claim that the rise in prices 
was connected with the change in the value of gold, saying that 
"the odds are 10,000 to 1 against a series of disconnected and 
casual circumstances having caused the rise of prices — one in the 
case of one commodity, another in the case of another — instead 
of some general cause acting over them all.” The general cause 
acting over them all was considered to be the change in the 
value of gold.® 

In 1887 Prof. F, Y. Edgeworth began a series of contributions 
to the problem of index numbers as a method of summarizing 
trends in price statistics. He brought to bear upon the field of 
the social sciences the mathematical theory of probability. He 
saw clearly that it is a problem of applying a strictly a priori 

^ “An Account of Some Endeavors to Ascertain a Standard of Weight 
and Measure,'* Philosophical Transactions of the Royal Society of Londan, 
Part I, Art. viii, pp. 132-185; citation from Wesley C. Mitchell, Btmness 
Cycles — The Prshlem and Its Setting (1928), p. 191. 

* Ihid.y p. 193. 

* /Wd., p. 195. 
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theory to an analogous situation, but he insisted that the theory 
of probability applied.^ Later, the theoretical application of 
probabilities to the problem of measuring social phenomena, and 
particularly the general price level, was taken up by C. M. 
Walsh, who published in 1901 a treatise on the measurement of 
the price level and later published a book entitled The Problem 
of Estimation, which further developed the application of prob- 
ability theory to economics.* 

Since about 1915 the important problem of the technique of 
index-number construction has been attacked by a number of 
scholars. Prof. Wesley C. Mitchell was a pioneer in the explora- 
tion of the technical problems involved and a major part of their 
solution; others have done important work of this character 
during recent years, especially the economists and statisticians 
in government or semigovemment agencies, such as the Bureau 
of Labor Statistics and the Federal Reserve Board. 

Interpretation of the problems involved in the making of 
index numbers may be fficilitated by analysis of two of the main 
principles involved: (1) the concepts of absolutes vs. relatives, 
and (2) the application of the theory of stratified sampling to the 
particular problem of the making of an index number. 

Conversion of Absolute Numbers to Relative Numbers. 
Absolutes. ^An absolute is an expression of the number of things 
being considered^ measured by an appropriate unit, as 1,000 
bushels of wheat or 50 acres of land. A simple absolute taken 
by itself is of little importance. The number of people in a 
country is of no particular significance unless a comparison is 
desired, for example, a comparison with the natural resources 
of the country or with the population at some other point in time 
or in some other country. 

Prices are ordinarily conceived of as absolute; that is, saying 
that the price of. wheat today is one dollar a bushel refers to the 
objective thing, namely, the concrete one dollar. It is true that 
this particular absolute has a ratio aspect when it is thought of 

* Pkbsons, W. M., “Statistics and Economic Theory," Review of Economic 
Statistics, Vol. 7, (1925), pp. 185-186. Also cited in Wesley C. Mitchell, 
Business Cycles — The Problem and Its Setting, (1928), p. 197. 

' The Measurement of General Exchange-Value, pp. 55.8-574, cited in 
Wesley O. Mitchell, “Index Numbers of Wholesale Rrices in the United 
States and Foreign Oountries," Bureau of Labor Statistics, p^ 9. 
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as a measure of the value of wheat. But when thought of merely 
as one of the goods in an exchange, the dollar can rationally be 
considered to be an absolute. Prices accordingly are referred 
to as ^^absolutes.^^ 

Relatives, In tabular form a ready visualization is often 
accomplished by converting absolutes to relatives of some selected 
base. For example Table 63 shows data on three important 
types of productive activity in the United States. 


Table 63. — ^Estimated Value of Selected Types of Private Con- 
struction Activity in the United States 


Years 

New factory 
construction 

Farm construction 

New nonfarm resi- 
dential construction 

Millions 
of dollars 

Index* 

Millions 
of dollars 

Index* 

Millions 
of dollars 

Index* 

Annual average 
1926-1929 

640 

100 

468 

100 

i 

4,066 

100 

1932 

78 

12 

125 

27 

641 

16 

1933 

128 

20 

175 

37 

314 

8 

1937 

391 

61 

360 

77 

1,530 

38 

1938 

192 

30 

345 

74 

1,515 

37 

1939 

200 

31 

340 

73 

1,860 

46 

1940 

337 

53 1 

360 

77 

2,077 

51 


Source: Survey of Current Business, Vol. 21 (February, 1941), p. 21. 
* Each index is on the base, average 1926-1929 ■= 100. 


Considerable difficulty is encountered in obtaining a clear 
mental picture of the comparative changes in these three series 
by. study of the absolutes themselves. Was the decline in new 
factory construction more severe in the 1932-1933 depression 
than the falling off in new residential construction? Did farm 
construction suffer more severely than new residential nonfarm 
construction? Such questions, involving comparative judg- 
ments, can be answered much more quickly if each series is 
converted into relatives or simple indexes upon a common base 
period. This is illustrated in Table 63, in the columns presenting 
the indexes with average 1926-1929 as the base. 

Simple index numbers, or relatives, of this sort involve the 
notion of comparing with unity. The mind more readily 
grasps expr^ions in round numbers than in odd numbers; it 

















602 . STUDY OF DYNAMIC VARIABILITY 

further reduces mental effort if the round numbers are in mul- 
tiples of 10. From this fact arises the practice of relating prices 
or other quantity figures or absolutes of any kind to each other 
in such a way as to get a comparison based upon 1, 10, 100, 
1,000, etc. If based upon 1, they are called proportions’^; if 
based upon 100, they are called percentages.” They are all 
relatives, or indexes. Most commonly in the United States and 
in Great Britain and many other countries, 100 is used, although 
a few, notably Australia, use 1,000. 

Even where there is but one price series, it is simpler to com- 
prehend the significance of change if the absolute prices are 
converted to a relative form. For example, the changes in 
price of coffee per pound, as shown in Table 64, are easier to 
trace from period to period when expressed in relatives. Thus, 


Table 64. — Price op Coffee 
Annual averages in New York market of No, 7 Rio coffee 
(In dollars per pound) 


Item 

Symbol 

\ 

1926 

1933 

1934 

1041 

Price, lb 

1 

0.182 

0.078 

0.098 


Relative 


100 ^ 

43 • 

54 

44 


Source: Bureau of Labor Statistics, Wholeadle Prices (June and December issues of 
specified years). 


let 1926 be considered 100 and the prices in other years related 
to it. The arithmetic involved is simple in principle and contains 
two steps: (1) dividing the series throughout by the base selected, 
which may more conveniently be done by multiplying throughout 
by the reciprocal of the base and (2) multiplying by 100. This 
method, illustrated in Table 64, makes the figure for the base 
period equal to 100, and the rest fluctuate as percentages of the 
base. 

Another elementary idea is involved in the making of relatives, 
and that is the reduction of nonhomogeneous sets of figures to a 
homogei^us base for purposes of comparison and to simplify 
interpretation of relative change among nonhomogeqeous things. 
For example, the prices of coffee per pound at different times, 
the prices of canned peaches per dozen cans, and the prices of 
wheat per bushel are ^ three presented in Table' 65 for coin- 
parisoh with each other. 











INDEX NUMBERS 


603 


It is difficult to compare the price of coffee per pound with the 
price of wheat per bushel on the one hand and with the price of 
canned peaches per dozen cans on the other, as they fluctuate 
from time to time; but if all are changed to relative numbers, 
by the method already described, with 1926 as a base period, 
the comparison may easily be made. This is illustrated in 
Table 66. 


Table 65. — Prices op Coffee, Canned Peaches, and Wheat^ 


Item 

1926 

1933 

1934 

1941 

Coffee 





Canned peaches 

1.993 

1.146 ' 

1.403 

1.528 

Wheat 

1.496 



0.993 


Source: Bureau of Labor Statistics, Wholeaale Prices. 

1 Prices of canned peaches are annual averages, quoted in dollars per dozen cans; prices 
of wheat are of No. 2 hard, Kansas City, quoted in dollars per bushel. 


Relatives Using a Base Period in a Time Series, Price relatives, 
and the relatives shown in Tables 65 and 66, illustrate relatives 
using a base period in a time series. Three fundamental pre- 
cautions must be observed in the use of such relatives. 


Table 66. — Price Relatives of Coffee, Canned Peaches, and Wheat 
Average 1926 = 100 




Iffil 
















1. It is almost always advisable and sometimes it is necessary 
to know the absolute figures as well as the relatives — else mis- 
interpretation or even misrepresentation is likely to result. 
A classic example of a use of relatives that produced misinterpre- 
tation and may perhaps have even been intended to be misrepre- 
sentation was the evidence presented in 1932 by some notable 
protectionist ‘‘statesmen*’ in the United States Congress. 
Following ate some of the statistics they issued for public 
consumption: 
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Table 67. — Some of the Large Increases in Imports during the First 
8 Months op 1932 
(In percentages) 


Percentage 


Commodity Increase in Imports 

Cod and other salt and pickled fish from Denmark ... 3 , 729 . 8 

Salmon, fresh or frozen, from Japan 2,511.8 

Fish in airtight containers from Canada 4,669,9 

Cheese from Denmark 136.3 

Wrapping paper (other than kraft) from Sweden.. . . 615.9 

Pig iron from Sweden 181.0 

Pig iron from the United Kingdom 61 1 ’. 3 

Wool and other yams from the United Kingdom 221.2 

Long-staple cotton from Egypt and British India, 
but transshipped from the United Kingdom: 

Egypt 1,283.1 

British India 1,128.1 

From Canada, fresh pork 237 . 9 

Dried peas from New Zealand 477.3 


The purpose of these statistics was to prove that a veritable 
flood of foreign goods was threatening to inundate this country, 
put out of business all its domestic producers, and lower the 
wages of domestic workers. But the statistics are not what they 
seem to be. Some of the items were so very small in the aggre- 
gate in January, 1932 (the date they began to increase according 
to the table), that they were not even listed in the extensive 
classified list of imports that is published monthly by the Depart- 
ment of Commerce. If an exceedingly small item is increased by 
1,000 per cent, it is still small. Each time it increases 1,000 per 
cent, it is only eleven times as large as before; 2,(K)0 per cent 
means twenty-one times as large. In January, 1932, the amount 
of pig iron imported into the United States from Sweden and the 
United Kingdom combined was less than 460 tons, worth about 
$4,500, which, compared with United States domestic produc- 
tion, was a mere nothing. The imports in January, 1932, of 
wrapping paper other than kraft from Sweden amounted to 
$2,025. The imports in the same month of Egyptian cotton, 
transshipped from the United Kingdom,* amounted to $982. 
The last item is particularly enlightening; it will be noted that 
the specification ‘‘transshipped from the United Kingdom’^ is 
carefully made. Most cotton of this type comes to the United 
States directly from Egypt and is not essentially competitive 
with American-grown cotton. 
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2. The meaning of a percentage figure is often ambiguous, and 
study of its background is necessary before it can be properly 
understood. An illustration of the misinterpretation of a per- 
centage figure can again be found in the arguments of American 
protectionists. When it is alleged that our tariffs are already 
too high, the protectionists like to reply that they are not too 
high. To prove their statement they point to the fact that a 
large percentage of the imports are on the “free list,’' that is, 
that a large proportion of imports into the United States are 
charged no duty at all. This argument sounds plausible, but 
its non sequitur quality becomes evident when it is realized that 
the tariffs on dutiable goods are so high that they are virtually 
excluded from entering; it is thus the virtual exclusion of certain 
dutiable imports that causes a large proportion of imports to 
be goods on the free list. If the entire 100 per cent of imports 
were on the free list, it would mean, not that the tariff was not 
high, but that the tariff was so high that none of the dutiable 
goods could come in. 

3. In a series of coordinate relatives, it is necessary to know 
the base and to specify it for the information of others. For 
example, death-rate figures are quite meaningless unless the 
comparison is known. The death rate may be expressed as so 
many deaths per 1,000 people or per 100 people, and the statisti- 
cian should indicate which comparison is made. Death rates 
for a given disease may be expressed as the number of deaths 
per 1,000 people afflicted by the disease rather than as the number 
of deaths per 1,000 people whether exposed to it or not; again, the 
nature of the comparison should be specified. 

In simple index numbers like those given as examples in 
Tables 64 and 66, it is’ essential to know that the base is 1926 
or the average of 1926-1929, as the case may be. This should 
always be indicated somewhere in the title or subcaptions of the 
table or in a footnote. 

Presumption of Normality in the Selected Base. When a series 
of coordinate relatives is constructed by relating a series of 
absolutes to some selected base, the base tends to be regarded 
as the normal level. Indexes in the series greater than 100 are 
looked upor# as above normal, or above par, and indexes less 
than 100 are looked upon as below normal. Since this tendency 
exists, it is always desirable to give study to the matter of 
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selecting the base. Is it in fact the one that is at normal level, 
or is one of the other absolutes of the series at normal level? 

For example, in the illustration given in Table 63, should the 
annual average of new factory construction, 1926-1929, be 
regarded as normal? Was there, on the average, a normal 
amount of new nonfarm residential construction in those years? 
It might reasonably be argued that taking 3 years as a base is 
better than taking only 1 year, because an average of 3 years 
might tend to offset extreme fluctuations and produce compari- 
sons that would tend, to be better than if only 1 year were used 
as a base. Thus the average of the 3 years might be about 
normal for each of the three types of construction compared, 
whereas if only 1 year were taken one or the other of the three 
types might have had an exceptionally high or low year. 

On the other hand, it may be pointed out that the years 
1926-1929 covered a range of years in which a great construction 
boom reached its peak. Consequently, all types of construction 
were above normal in all three of those years; some writers claim 
this was the peak of the greatest and longest construction boom 
in history. Construction was at a high level such as it might 
not be expected to reach again for many years, at least if the 
length of construction booms is some seventeen years from peak 
to peak, as some say it is. It may therefore be argued that 
1937 would be a better base to take, even if only 1 year is used. 
In that year the general level of economic activity seemed to 
be nearer to a normal or equilibrium than any other year in 
recent history, and certainly nearer normal than the boom year 
of 1929. But the year 1937 would be a poor base year for strike 
statistics because of the great disturbances in the coal industry 
in that year. 

Selection of the base has an important effect upon subsequent 
judgments as to the trends of the three series. If the average 
1926-1929 is taken as the base, all three of the construction 
activities were still below normal in the year 1940, as the indexes 
in Table 63 show; but if 1937 is taken as the base, the 1940 level 
of new factory construction would be 86, the 1940 level of farm 
construction would be 100, and the level of new nonfarm residen- 
tial construction would be 136. If 1937 is considered normal, 
in the years 1926-1929 new factory construction averaged 63 
per cent above normal, farm construction averaged 30 per cent 
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above normal, and new nonfarm residential construction was 
166 per cent above normal. 

The data shown in Table 64 and the indexes presented in 
Tables 65 and 68 may also be used to illustrate the effect of the 
base selected. In Table 66 and Fig. 133, the year 1926 is the 
base and the prices of coffee, canned peaches, and wheat are 
each set at 100 in that year; subsequent years are indexed 
accordingly. From 1926 to 1933 the greatest decline occurred 
in the price of coffee, the next greatest occurred in the price of 
canned peaches, and the decline in the price of wheat was com- 



Fig. 131^, — Indexes of prices of wheat, canned peaclics, and coffee. 1926 = 100. 

paratively the least. Their relative recovery was in the same 
order, and all three were below normal in 1941. 

But if it is cpnsidcred that they were at normal levels in 1941 
so that all are called 100 for that year, a quite different picture 
is obtained, as shown in Table 68 and Fig. 134. If these prices 


Table 68. — Price Relatives of C'offeb, Canned Peaches, and Wheat 

1941 = 100 


Item 

1926 

1933 

1934 

1041 

Coffee 

228 


122 

100 

Canned peaches 

130 


92 

100 

Wheat 

151 


94 

100 


were normally related to each other in 1941, then in 1926 the 
))rice of coffee, was more than twice normal, the price of wheat 
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was well above 50 per cent over normal, and the price of canned 
peaches was 30 per cent over normal. Moreover, Fig. 134 and 
Table 68 seem to indicate that it was the price of canned peaches 
that was farthest below normal in 1933; the price of coffee was 
only slightly below normal. 



Fio. 134. — Indexes of prices of coffee, canned peaches, and wheat. 1940 ~ 100. 

For most comparisons, a year too remote in the past is not a 
desirable base. For a long time, 1913, or an average of the 
years 1909-1914, was looked upon as the best base period to use, 
because it was the last normal period before the First World 
War. The farm bloc in Congress continued as late as 1941 to 
insist that farm prices should be permitted to rise to the par 
that existed before the First World War; but in 19,41-1942, as 
farm prices began to rise at a more rapid rate than other prices 
so that they passed parity, the farm bloc began to insist upon a 
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new definition of parity. The long survival of 1909-1914 as a 
base illustrates, not the general desirability of having a remote 
base period, but merely the power of the farm bloc. Ordinarily, 
general economic change over a 28-year period is sufficiently 
great to make such a base undesirable. 

In the 1920’s, accordingly, most comparisons came to be made, 
not with prewar 1913, but with the average of 1923-1925 or with 
the single year 1926; these years persisted as a base period much 
longer than might ordinarily be expected because the extreme 
decline of the depression of the early 1930’s made it difficult to 
select a new base period. Finally, however, as the years of the 
Second World War passed, the period immediately preceding it 
came to be regarded as the best base for current comparisons. 
In the early 1940’s the average for the years 1935-1939, or one 
of those years, began to be adopted as the base period.^ 

Relative Parts of a Whole. A single absolute quantity is often 
divided into several parts, and these several parts are expressed 
as percentages or proportions of the whole. These are properly 
called, not index numbers,” but simply ‘‘relatives,” although 
they could be referred to as “constituent relatives.” The term 
index numbers, used with strict propriety, refers to a series of 
relatives that is a composite of a more or less large number of 
series of relative numbers. The series of relatives may be com- 
bined to form a series of index numbers by any one of a number 
of methods of aggregating or averaging, as will be explained 
later in this chapter. Accordingly, in strict usage when an 
index is an average of relatives, the term relative should be 
reserved for the separate ingredients and the term index should 

^ In the Survey of Current Businessj Vol. 22 (November, 1942), the index 
of prices received by farmers was still reported on the base of the average of 
1909-1914 prices, the index of wholesale prices was still based on the 1926 
average, and the index of retail prices was based on the average for 1923- 
1925; but the cost-of-living index was based on the average 1935-1939, 
and the indexes of the purchasing power of the dollar (wholesale, retail, 
and farm) were based upon the 1935-1939 average. The indexes of national 
income and industrial production were on the average 1935-1939 base. 
The index of some manufacturing data, such as orders, shipments, and 
inventories, were based upon the averages for the single year 1939. The 
Survey of Current Businessj Vol. 22 (December, 1942) published the Bureau 
of Labor Statistics indexes of wage-earner employment and weekly wages in 
manufacturing ' industries, revised, with the average of the year 1939 as 
the base. 
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be used for the composite. Yet this distinction is often honored 
in the breach as well as in the observance, and the student must 
expect to find the term index used in place of relative. 

An important item to remember in the use of constituent 
relatives is that a relative increase or decrease does not neces- 
sarily mean an absolute increase or decrease in the subgroup. 
The absolute of the subgroup may, indeed, move in the opposite 
direction from that indicated by the relative figures. Con- 
stituent relatives are useful when it is required to see clearly 
the relative changes. If absolute changes are desired, the raw 
data must be examined. Table 69 is an example of the use of 
constituent relatives. It reveals the necessity of attention to the 
absolute as well as the relative figures. 

Table 69. — Death Rates per 100,000 Policyholders from Selected 

Causes 

Weekly premium-paying industrial business^ Metropolitan Life Insurance 

Company 


Specified causes of death 

Annual rate per 
100,000* 

Percentage distribution 
of specified causes 

1940 

1941 

1942 

1940 

1941 

1942 

All 

531.6 

553.9 

501 .6 

100.00 

100.00 

100.00 

Diabetes mellitus 

31.1 

33.8 

30.2 

5.85 

6.10 

6.02 

Appendicitis 

8.6 

7.6 

5.4 

1.62 

1.37 

1.08 

Influenza and pnetimonia 

74.5 

79.5 

47.2 

14.01 

14.35 

9.41 

Tuberculosis (all forms) 

44.9 

44.0 

41.9 

8.45 

7.94 

8.85 

Syphilis 

12.4 

11.0 

10.0 

2.33 

1.98 

1.99 

Cancer (all forms) 

102.1 

103.8 

102.2 

19.21 

18.74 

20.37 

Diseases of the heart 

233.3 

245.9 

236.7 

43,89 

44.39 

47.19 

Motor- vehicle accidents 

17.2 

20.2 

21.3 

3.24 

3.65 

4.25 

Suicides 

7.5 

8.1 

6.7 

1.41 

1.46 

1.34 



Source: Metropolitan Life Insurance Company, Statistical Bulletin, March, 1942, p. 11, 
* Policyholders, based upon first 3 months of each year. 


In order tc illustrate the necessity of presenting the absolute 
figures as well as relative figures when constituent relatives are 
used. Table 70 is drawn up with a few items taken from Table 69. 
Study of the percentage distribution shown in Table 70 would 
appear to indicate that the death rate from suicides increased 
between the years 1940 and 1942. Actually, iU decreased. 
Merely its relative position became more important. The per- 
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centage distribution, in the absence of attention to the absolute 
figures, would also lead to a tendency to exaggerate the rise in 
the death rate from automobile accidents. These misleading 
results are due to the change in the size of the totals for the 
respective years considered — ^from 107.8 in 1940 to 115.4 in 
1941 and to 80.8 in 1942, 

Table 70. — Death Rate per 100,000 Policyholders from Selected 

Causes 

Weekly premiurnr-'paying industrial husinessj Metropolitan Life Insurance 

Company 


Specified causes of death 

Aiinuiil rate* 

PercentaKC distribution 
of specified causes 

1940 

1941 

1942 

1940 

1941 

1942 

All 

107.8 

115.4 

80.8 

100.00 

100.00 

100.00 

Influenza and pneumonia 

74.5 

79.5 

47.2 

69.11 

68.89 

58.56 

Appendicitis 

8.6 

7.6 

5.4 

7.97 

6.59 

6.70 

Suicides 

7.5 

8.1 

6.7 

6.96 

7.02 

8.31 

Motor-vehicle accidents 

17.2 

20.2 

21.3 

15.96 

17.50 

26.43 


1 Per 100,000 policyholders, based upon first 3 months of each year. 


Especially when a small number of rates are being considered, 
as in Table 70, it is necessary to study both the rates and the per- 
centage distribution. Actually, the study of rates is required to 
answer the question: Is the rate from suicides greater in 1942 
than in 1941? Study of the percentage distribution of specified 
causes is required to answer the (luestions: In 1942 were motor- 
vehicle accidents a more important cause of death than influenza 
and pneumonia combined? Did motor- vehicle accidents become 
relatively more important from 1940 to 1942 as compared with 
the other specified causes? Important questions are answered 
by each of the sets of figures; what is necessary to avoid is the 
use of the wrong set of figures to answer a given question. 

Great Variety of Simple Index Numbers in Use. Hundreds of 
simple index numbers are in use, and the number has been 
increasing rapidly since the First World War. Indexes of the 
simple type illustrated in Tables 63, 65, and 68 exist for nearly 
every separate industrial activity, for thousands of prices, for 
retail sales* wholesale sales, inventories, consumption of certain 
types of go6ds, and for many other things related to economic 
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and social activity. Index numbers measuring production 
from month to month in a large list of industries have been com- 
piled and published by the Board of Governors of the Federal 
Reserve System and other agencies. Indexes of marketing of 
fish, dairy products, livestock, wool, and poultry and eggs have 
been compiled by the Bureau of Foreign^and Domestic Commerce; 
and indexes of the marketing of cotton, fruits, grains and vege- 
tables, lumber, and other natural products are compiled by the 
same bureau. This bureau has also compiled and published a 
large number of simple relative figures for new orders and unfilled 
orders in a number of manufacturing industries, including iron 
and steel, paper, lumber, textiles; and another series of index 
numbers of commodity stocks of manufactured goods and of raw 
materials, such as chemicals, foodstuffs, metals, textile materials, 
and rubber products. These indexes are published currently in 
the Current Survey of Business by the United States Department 
of Commerce. In the League of Nations publications, indexes of 
world stocks of foodstuffs and certain raw materials are available. 

The United States Department of Commerce has recently 
begun the compilation and publication of indexes of transpor- 
tation for the United States. These monthly indexes include a 
combined index of all types of transportation, commodity and 
passenger, and also indexes by types of transportation, such as 
an index of air transportation and a combined index of intercity 
motorbus and truck transportation. The indexes are published 
monthly with the base period 1935-1939 = 100 and appear in 
the Survey of Current Business.^ This publication contains 
other illustrations of the many uses of index numbers. 

The use of either subordinate or coordinate relatives to aid 
in the interpretation of series of data does not involve the appli- 
cation of the theory of statistics or the principles of sampling, 
although the gathering of the raw data may have involved the 
use of the latter. The rules of comparability must be considered 
when numbers are converted into relatives, however, as indi- 
cated in the discussion above. When a whole series of these 
simple index numbers, or relatives, are combined into a com- 
posite index number, it is necessary to make application of the 

1 The transportation indexes are described in the Survey of Current 
Business, Vol. 22 (September, 1942), pp. 20-28; Vol. 23 (May, 1943), pp. 
26-27. 
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theory of statistics. The principles of stratified sampling apply 
to the construction of these composite index numbers. 

Index Numbers. Application of Sampling Technique. Index 
numbers are combinations of a large number of single series of 
relatives by some method of aggregating or averaging. In the 
field of prices, indexes of farm prices, of cost of living, of retail 
prices, of wholesale prices, of wages, and of exchange rates are 
some of the index numbers obtainable. Also, indexes of indus- 
trial production, of trade activity, of retail trade, and of 
employment are found in various sources. All these indexes are 
combinations of numerous series of relatives. 

From consideration of the various purposes for which index 
numbers may be used, it should immediately be apparent that a 
difficulty is involved. How, for example, is it possible to get 
together all the facts in the United States regarding all whole- 
sale prices from time to time, or all wages, or all retail prices, or all 
production or consumption activities? The answer, of course, 
is that it is not possible, or certainly not feasible, but that a 
sample of some kind must be used. When a composite is made 
up of several series, how shall they be weighted? Should they 
be considered of equal importance, and if not how shall their 
relative importance be determined in making up the composite? 
It is upon the basis of the principles of sampling that such 
index numbers are justified. As Prof. Edgeworth once said, 
the task is to extricate from fallible observations a mean apt 
to represent the general trend of prices, wages, production, or 
whatever is being measured.^ 

The demonstration by eighteenth- and nineteenth-century 
statisticians, such as Sussmilch and Qu4telet, that a hitherto 
unsuspected regularity lay hidden in numerical data about 
social phenomena encouraged economists and social scientists 
in the belief that known variations that had been measured might 
be fair samples of the more numerous unknown variations. 
Furthermore, the construction of a great variety of composite 
index numbers by different investigators using different methods 
has produced results of siich consistency as to inspire confidence 
in their use.^ 

^ Cf, MitcAell, Wesley C., Business Cycles — The Problem and Its Setting 
(1928), p. 204. 

. * Cf. Mitcheii:., Wesley C., ‘Tndex Numbers of Wholesale Prices in the 



514 


STUDY OF DYNAMIC VARIABILITY 


To grasp the significance of an index number, it is not sufficient 
to have reference only to the summary picture it presents. Just 
as in the case of an average of one frequency distribution, so in 
the case of index numbers, the distribution of cases is of great 
importance. An index number is really a series of averages 
based upon a series of frequency distributions — one frequency 
distribution for each time period — of which the index number 
itself is an average of some sort. A study based upon this idea 
was made by Wesley C. Mitchell in his analysis of year-to-year 
fluctuations of the prices recorded in the wholesale price bul- 
letins of the Bureau of Labor Statistics, covering prices from 
1891 to 1918 and including 232 to 348 commodities. He found 
that the price changes from year to year formed a fairly sym- 
metrical frequency distribution each year, and hence he con- 
cluded that ^^when it can be shown that phenomena are 
distributed approximately in this fashion, their average can safely 
be accepted as a significant measure of the whole set of variations, 
since even the deviations from the average are then grouped in a 
tolerably definite and symmetrical fashion about the average.’ 

Such an analysis seemed to establish as satisfactory the use 
of an average to summarize price change from year to year; but 
index numbers frequently extend over a considerable period of 
time so that the general level of wholesale prices of 1942, for 
example, is compared with 1926 as a base or ivith 1935-1939. 
Year-to-year fluctuations may occur in a manner such that the 
average may be used to summarize; but what of change com- 
pared with some year more remote in the past? In order to 
test the reliability of the method of index-number construction in 
this regard. Prof. Wesley C. Mitchell applied the technique of 
taking several samples; in one sample he took 242 commodities, 
in another 60 commodities, and in a third sample 25 commodities 
at wholesale prices and constructed three sample index numbers 
for the period 1890-1913. He found that the results from the 
smaller samples were strikingly close to those of the larger 
sample.^ 

United States and Foreign Countries,” Bureau of Labor Statistics, Bvlletin 
284, p. 11. 

1 Ibid,, pp. 17-18. 

* Ibid.^ p. 38, The theory of sampling errors does not apply in a way 
that makes possible mathematical tests from a single sample. 
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Stratified Sampling Method Applied. Others have also found 
that whenever the principles of stratified sampling have been 
followed in the construction of index numbers of wholesale 
prices, the results obtained are similar to the results obtained by 
the use of all available data. This inspired confidence in index 
numbers extending back through the years, for which fewer price 
series are available in published records, and at the same time 
increased the belief that such an average expression of prices in 
the form of index numbers is a valid summary picture of general 
price change. 

It is upon the basis of the principle of stratified sampling that 
it is possible to measure by index numbers, such things as the 
cost of living, or the volume of production, or the general whole- 
sale price level. Also, it is upon the basis of the theory of 
sampling that credence can be given to index numbers; in 
addition, it is due to this very fact that it is necessary to examine 
the constituent parts of an index number to be sure that it 
measures what it purports to measure and that it is applicable to 
any particular problem for which it is desired to use an index 
number. 

It is necessary to notice that stratified sampling is applied to 
the making of index numbers. For example, take the problem 
of measuring general price movement. This is not a case in 
Avhich there is an infinite number of items, although the universe 
is a very large number, and the number for which data are given 
is probably less than the number for which data are unavailable, 
particularly in the case of retail prices or wages. In the case of 
wholesale prices the available data cover a larger portion of the 
universe. 

Not only is there not an infinite number of items, but the 
number of available items is often not a very large one. For 
example, some index numbers are based upon less than 50 
individual index-number series. However, the universe from 
which the items are taken is one concerning which a priori 
knowledge exists. According to such a priori knowledge, a 
representative sample can be obtained by a conscious or delib- 
erate proportional selection of items from the various known 
strata of the universe. For example, it is known, in the case of 
wholesale prices, that the universe is made up of prices of foods, 
prices of metals, prices of forest products, prices of various raw 
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materials, prices of semimanufactured products in a number of 
fields that can be classified, and prices of final goods at whole- 
sale, to enumerate a few of the known strata of this universe. 
Knowing that such strata exist in the universe, the sample can 
be made proportional by a deliberate stratified random sampling 
procedure that would ensure proper representation in the sample 
of all the various strata known to exist in the universe.^ 

Variety of Purposes of Judex Numbers. As a historical propo- 
sition, the original all-pervading purpose of an index number 
was to measure general exchange value, that is to say, to explain 
the relationship between prices, in their general or average move- 
ment, and the value of money and credit. 

At the present time, however, a large number of general 
indexes of prices and other phenomena are currently published; 
but few even of the general price indexes purport to be a measure 
of the value of money. General indexes of retail prices, indexes 
of wages and pay-roll totals, indexes of prices of farm prices, 
metal products, manufactured goods, and raw materials, as 
well as general wholesale prices, are now available. Which of 
these price indexes really measures the value of money? 

Some statisticians and economists have held that a real 
measure of the changes in the value of money and credit should 
include, not only wholesale prices, but also wages, rent, and 
other prices, including retail prices and perhaps the prices of 
securities. Samples of each kind of price should be included in 
the index of prices that aims to measure general exchange value. 
On this theory, Carl Snyder, at that time statistician for the 

1 Cf. King, W. I., Index Numbers Eluoidaledj especially pp. 64-66. This, 
of course, often turns out to be a counsel of perfection in practice. The 
principle is based upon the assumption that in each of the strata designated 
the available data can be sampled successfully at random; and in practice 
this is often not true. For illustration, in gathering prices for an index of 
wholesale prices such subgroups of prices, or strata, as sulphuric acid and 
Portland cement are standardized, while house furnishings are not. From 
the point of view of obtaining the best possible results with the minimum 
amount of price gathering, and presumably with limited funds for the 
purpose, it would be sound practice tO abandon the counsel of perfection 
and spend less money gathering prices of standardized articles and more 
gathering prices of nonstandardized articles. The resulting disproportionate 
amo\mt of prices in the respective subgroups can then be cousitered by the 
required adjustments in the weights used to combine the series of relatives 
into index numbers. 
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Federal Reserve Bank of New York, compiled an index of 
general price level, including wholesale and retail prices, wages, 
rents, etc., but for certain reasons he excluded security prices. 
After careful study, these various components were given certain 
weights in the general composite. It should be pointed out that 
this index of general price level was originated for the special 
purpose of deflating data on bank clearings. Since bank clear- 
ings included payments for all these things, Snyder believed that 
an index of prices based upon these components could be used to 
cancel out that part of change in total bank clearings due to 
price change and obtain thereby an index of physical volume of 
trade. Even if it is granted that this index of general price level 
is valid as a deflator of bank clearings, it still remains a question 
whether or not it measures the exchange value of money. 

It could be argued with considerable force that such a general 
measure is impracticable because of the difficulty of getting 
adequate samples of rents, for example. And in any case, such 
a general measure of prices does not really give the measure of 
change in the purchasing power of money. The general pur- 
chasing power of money may be a far more flexible and possibly 
sensitive factor than, this general price average would indicate. 
A general price average would include an overweight of prices 
largely controlled by custom, or of prices in which resistance to 
change is very great for some other reason, as, for example, 
because of public regulation, taxation, or their indirect effects. 
The true measure of change in general exchange value may 
be more nearly approximated by the wholesale price index and 
perhaps even by the group of more sensitive wholesale prices. 

It is not the purpose here to carry this argument to a conclusion 
but merely to suggest its unsettled state. It may be significant 
that the Bureau of Labor Statistics has published reciprocals of 
its several indexes of prices — wholesale, retail, cost-of-living, and 
farm products — as indexes of the purchasing power of the dollar 
in those respective fields. The question of how to measure 
general exchange value, or the purchasing power of money, con- 
tinues to be a controversial one. Meanwhile, index numbers 
continue to serve enormously useful special purposes whether 
or not collectively or individually they measure general exchange 
value. In his Treatise on Money, J. M. Keynes appears to 
suggest that thq exchange should be looked upon as a number of 
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relatively noncompeting groups of markets and that there may 
be no such thing as a general purchasing power of money. ^ 

In the light of such theoretical difficulties hindering the 
proper measurement of the purchasing power of money by using 
reciprocals of price indexes, recent attempts have been made to 
construct indexes of the purchasing power of money by other 
means. One notable contribution is the index of purchasing 
power constructed by Murray Shields; this combines monetary 
data, viz.f demand deposits, foreign deposits in the United 
States, foreign bank deposits in Federal reserve banks, volume of 
money in circulation, and cash in the vaults of commercial 
banks. ^ 

Construction of Index Numbers. Principal Methods, Prof. 
Irving Fisher of Yale University, in a comprehensive study 
of the mathematics of index-number making, found several 
hundred kinds of formulas for calculating index numbers; but 
it is quite unnecessary to be disturbed by this fact, since as 
he himself says, only a few of them are of any value. There are 
two principal methods of calculating index numbers now in use 
and generally recognized as adequate for most purposes, but 
other methods are occasionally used and will therefore be 
described. The most commonly used are (1) the weighted 
average-of-relatives method and (2) the weighted aggregative 
method. Other methods sometimes used are (3) the simple 
average-of-relatives method and (4) the simple aggregative 
method. Various alternative ways of applying these methods 
are possible. For example, in the case of the simple average of 
relatives, sometimes the median is used instead of the arith- 
metical mean in order to avoid extreme variations; it is advisable 
to use the median especially for very small samples. These 
methods will be taken up in the order (3), (1), (4), and (2), 
which is the logical method of treating them, rather than in the 
order of their prevalence in use, which is that given above. 

Simple Average-of-relatives Method. Referring again to the 
simple case of the prices of coffee, wheat, and canned peaches, 
already used, perhaps the first method that would suggest itself 

^ Cf. also Beckhart, B. H., The New York Money Market, Vol. 2, and 
King, op. cit, pp. 189-216. ” 

* A Measure of Purchasing Power Inflation and ’Deflation,” Journal of 
the American Statistical Association, Vol. 36 (1940), pp., 461-470. 
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to anyone desiring to obtain a summary figure representing 
average price change would be to add up the relatives and divide 
by their number, as follows: 

Table 71. — Composite Index Number of the Prices of Coffee, Wheat, 
AND Canned Peaches 



1926 - 100 



Commodity 

1926 

1933 

1941 

Coffee 

Canned peaches 

Wheat 

— = 100 

Po 

^ = 100 

Po 

% = 100 

Po 

?i = 43 

Po 

^ = 58 

Po 

5i;-« 

Po 

- = 44 

Po 

Po 

4-m 

Po 

Average 

3)300 

100 

3)149 

60 

3)187 

62 


The resulting composite index number shows that on the 
average these three prices fell to 50 in 1933 in comparison with 
100 in* 1926 and then rose on the average to 62 in 1941 in com- 
parison with 100 in 1926. Reducing this method to symbols, 
Let po, Vh P 2 represent the prices of coffee. 

Po} Pu P 2 represent the prices of canned peaches. 

Poj Pij P 2 represent the prices of wheat. 

The relatives that appear in Table 71 are thus shown also in 
symbols. For example, the ratio p^Hp” corresponds to 66 — ^in 
these symbolical presentations the multiple 100 is always ‘‘under- 
stood'^ and not actually written in the formula. The averages 
for the three would be expressed by symbols in Table 72: 


Table 72 


1926 

1933 

1941 

/ ff 

■ 

/ ff 

/ ff 

Po . Po , Po 

Pi . Pi , Pi 

, Pi , Pi 

1 7 + -*77 

1 7 H 77 

1- — + -7/ 

Po Po Po 

Po Po Po 

Po Po Po 

3 

3 

3 


These averages are represented by the letter P, and when N 
commodity pric^ are averaged, instead of only three, for n years, 
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instead of only 3, the series of averages of relatives is repre- 
sented symbolically as follows: 


P 


00 



Poi = - 


Y Vl 
2L/v^ 


N 



( 1 ) 


The capital N refers to the number of prices, and the small sub- 
script n refers to the number of years, or number of time periods, 
which might be months or weeks as well as years. In general, 
the subscripts to the series of represent the time periods, and 
0 is assigned to the base time period, at which the relative equals 
100. The average of the relatives likewise equals 100 in the 
base time period. The primes refer to different commodities. 

Weighted Average-of-relatives Method. The simple average of 
relatives involves the assumption that changes in the several 
prices to be combined are of equal importance; but this may not 
be true. Consequently, the idea of weighting the component 
price relatives in accordance with weights that are considered to 
reflect their relative importance has been developed. 

The weights are commonly based upon some rational con- 
sideration such as the quantities consumed in a given represen- 
tative year, the quantities produced, family budget figures, or 
some other criterion. Suppose, after considering all available 
information on the subject, changes in the price of a pound of 
coffee are considered thrice as important as changes in the price 
of a dozen cans of peaches and changes in the price of wheat per 
bushel are judged twice as important as changes in the price of a 
pound of coffee. Convenience of calculation will be attained 
if the numbers used as weights are so arranged that they will 
sum up to 1, 10, or 100, because the averaging process will then 
be a simple matter of changing decimal points in the sum of the 
weighted relatives. Such a manipulation of the quantities 
representing weights will have no effect on the final answer and 
will reduce the amount of work considerably if the problem is a 
long one involving, say, several years of monthly indexes. In 
the illustration used above, the weights are as follows: 

Coffee, Z = w • 

Canned peaches, 1 = 

Wheat, 6 == It?" f 
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A weighted average-of-relatives index number of these three 
commodities would be calculated as illustrated in Table 73. 


Table 73. — Index Number of. the Prices of Coffee, Wheat, and 

Canned Peaches 

Weighted average of relatives, 1926 = 100 


Commodity 

1926 

1933 

1941 

Coffee 

100 X 3 = 300 

43 X 3 = 129 

44 X 3 - 132 

Canned peaches 

100 X 1 = 100 

58 X 1 = 58 

77 X 1 - 77 

Wheat 

100 X 6 = 600 

48 X 6 X 288 

66 X 6 X 396 

Weighted average 

10)1000 

10)475 

10)605 


100 

47.5 

60.5 


In symbolic language, the weighted average of relatives illus- 
trated in Table 73 is as follows: 


P 


00 






Po2 





( 2 ) 


Instead of weighting by arbitrary weights, the actual quan- 
tities of the articles consumed or produced in the base year are 
sometimes used as weights, if such data are available. The 
quantities of the base year or base period are retained through- 
out, instead of getting the new quantities each year or each time 
period, for two reasons: (1) because it is difficult if not impossible 
to get quantity figures for every year and (2) because the pro- 
portions between these quantities are not likely to change 
greatly over short periods of time. If, after a given base period 
has been used for some time, it is discovered that one or several 
of the quantity weights are at variance with current conditions 
that seem to be likely to persist, the system of weights may be 
revised. In the various index numbers it constructs, the Bureau 
of Labor Statistics keeps continually on the watch for such 
changing conditions and when desirable changes the weighting 
system. 

The symbols for quantity weights are series of g^s, as follows: 

go = quantity of coffee in 1926 
qi = quantity of coffee in 1933 
g 2 = quantity of coffee in 1941 
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gj = quantity of canned peaches in 1926 
g' =: quantity of canned peaches in 1933 
^2 = quantity of canned peaches in 1941 


Wheat would be the same arrangement of a series of The 

resulting index number, using base-year quantities as weights for 
the relatives averaged, would be as follows: 






Pi 

Po 


Sgo 


2go 


Poi = i 




Po 




( 3 ) 


Simple Aggregative Method. As suggested by its name, a 
simple aggregative index is the sum of the absolute prices, 
wthout first changing them to relatives. Thus the raw 
prices of coffee, canned peaches, and wheat for 1926, then for 
1933, and then for 1941 would be added together to give 
the index. This seems to be combining nonhomogeneous 
things, and it is; nevertheless, there is one famous and at one 
time widely used index that was based upon this method. 
Such was Bradstreet’s index of wholesale prices, which continued 
in use for many years. Following is an illustration of the method 
of Bradstreet^s index of wholesale prices:^ 


Prices, Dollars per Pound 
0 . 0007 Connellsville coke, southern coke 
0.001 Bituminous coal, brick, iron ore 

0.002 Anthracite coal 

0.003 Salt 

0 . 004 Bessemer pig iron 


0 . 34 Alcohol 

0 . 50 Australian wool 
0 . 52 Quicksilver 

0 . 84 Rubber 

9 . 8530 The sum, which is the index 

According to this method, the index number of prices does 
not assume the form of a relative, but appears as follows: 

1 A good description of BradstreePs index is contained in W. C. Mitchell, 
“Index Numbers of Wholesale Prices in the United States and Foreign 
Countries,” Bureau of Labor Statistics, Bulletin 284, pp. 161-165. Other 
price indexes are also discussed in that source, such as Dun’s, Gibson’s, 
the Annalist j War Industries Board, Federal Reserve Board, and the 
Bureau of Labor Statistics. 



INDEX NUMBERS 


523 


Table 74. — Bradstrebt’s Index 


1933 

Index 

1934 

Index 

October 

$9.0512 

8.8480 

8.8126 

January 

$8.8329 

9.0110 

9.2627 

November 

February 

December 

Mareh . 




The index can readily be converted into a series of relatives 
upon any chosen base; the Survey of Current Business published 
Ikadstreet’s index converted into relatives, with the monthly 
average of 1926 = 100 until November, 1937, when compilation 
of the index was discontinued.^ 

Little rational justification can be mustered to the defense of 
such an index as Bradstreet^s, except that it worked well. Using 
approximately 96 commodities, it gave an index number that 
reflected accurately the changes in wholesale prices, as tested 
by more elaborately conceived and compiled indexes of 
wholesale prices later introduced into the field. Bradstreet^s 
index was the pioneer in the history of price indexes in the 
United States, having been started in 1897. The conversion of 
all prices into prices per pound gives the effect of a concealed 
weighting, but no logical basis can be found for such a system of 
weighting. The symbolic expression of this index is as follows: 

Spo, Spi, 2p2, . . . , Sp„ (4) 

When reduced to relatives and some base is taken as 100, it is 
as follows: 


•Poo = 


Spo 

Spo’ 


Poi = 


Spo 


p _ 2p» 

Spo 


(5) 


While the concealed weighting system of Bradstreet’s index is 
accidental, or haphazard, depending upon the units in which 
goods are quoted, it has the effect of making the high-priced 
articles dominant. Its success as a good index of price change 
was due to the fact that there was a skillful or at least a pro- 


^ Cf. Current Survey of Business^ Supplement, Vol. 18 (1938) p. 168. 
Monthly figures for the index are available from 1903 and annual figures 
from 1890. See 1932 Supplement, pp. 28-29 and 1936 Supplement, p. 15. 
Also see Bureau of Labor Statistics, Bulletin 173 (July, 1915). 
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pitious use of stratified sampling in the selection of the prices 
used. 

Weighted Aggregative Method. In making index numbers by 
the aggregative method, it is usually considered that weights 
are required, for the same reason that they are regarded as 
necessary in constructing an index number by the average-of- 
relatives method. The most reasonable kind of weight would 
seem to be the quantities of the several commodities produced or 
consumed or marketed. Such figures have become increasingly 
available since the time when such indexes as Bradstreet^s were 
originally conceived and developed. 

The last four or five decennial censuses of the United States 
have included more and more complete data on physical quan- 
tities of production and, more recently, data on retail and 
wholesale trade; and, in the years since the First World War, 
yearly figures have been available on physical quantities of goods 
in stock and physical production of some goods, through the 
activities of the United States Departments of Commerce and 
Agriculture. If it is assumed that the method of weighting is 
one that uses actual quantity figures, there are two methods of 
weighting the price aggregates in order to construct the index 
number. The first method is called weighting by base-year 
quantities.'' The second method is called ^‘weighting by given- 
year quantities." 

The desirability of weighting by base-year quantities has a 
twofold explanation: (1) In spite of the increased availability 
of quantity figures, there are still many commodities for which 
quantity figures are not easily available for every year; but a 
large number of such quantity figures, classified so as to be 
useful for weighting purposes, are available for the census years. 
(2) With few exceptions, the proportional changes in the quan- 
tities or value weights from year to year are not sufficiently great 
to cause large errors if these proportions are assumed to remain 
constant for several years in succession. Adjustments in the 
quantity or value weights can be made in the case of rapidly 
growing or rapidly declining industries, but the necessity for such 
changes within 10-year periods will not include a very large 
number of commodities. As a purely practical matter, the 
choice of base-year weighting instead of given-year weighting 
gives adequate results with much less statistical calculation, as 
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well as much less statistical research in seeking data for use 
as weights. 

United States Bureau of Labor Statistics. Construction of 
Index Illustrated by Practice of an Official Bureau. In the 
United States, the Bureau of Labor Statistics is one of the 
most important official compilers of index numbers of various 
kinds. From its publications can be illustrated how the various 
matters discussed above are brought into practice and how 
diligent must be the researcher, how alert the statistician, to 
new problems of weighting, sampling, and the like. 

In 1943 the Bureau of Labor Statistics of the United States 
Department of Labor was compiling and publishing weekly, 
monthly, and annual index numbers of wholesale commodity 
prices. In a revision made in 1927, when the base period was 
changed from the 1913 average to the 1926 average, a new weight- 
ing system was adopted; it was then decided to revise the quan- 
tities used as weighting factors every 2 years, as the results of 
each new biennial census of manufactures became available. 
At the same time, the number of price series was increased from 
404 to 550. Another revision was made in 1931, when the num- 
ber of price series was changed from 550 to 784 and some rear- 
rangement of the items in the groups and subgroups was made. 
No change was made in 1931 in the method of calculating the 
indexes. In December, 1942, according to the Survey of Current 
Business, the monthly index of wholesale prices compiled by the 
Bureau of Labor Statistics was made up of 889 quotations.^ 

The weights used for farm products are based on averages 
for 3-year periods, changed every 2 years in order to keep the 
weights up to date. Thus, for the years 1932 and 1933, the 
weights used for farm prices were based upon averages of quan- 
tities marketed in the years 1927, 1928, and 1929; and for the 
years 1934 and 1935 the weights used for farm-products prices 
were based upon averages of quantities marketed in the years 
1929, 1930, and 1931. For all other groups of commodity prices, 
the weights used are averages of quantities produced for sale, to 

^ Survey of Current Business, Vol. 22 (December, 1942), p. S-3. On the 
history of its compilation, weighting, etc., see Bureau of Labor Statistics, 
Bulletin 181, 4l5, 453, 521; ^‘Wholesale Prices,” Serial No. R1434 (Decem- 
ber, 1941); ^'Revised Method of Calculation of the Wholesale Price Index 
of the United Statds Bureau of Labor Statistics,” Serial No. R.666. 
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which has been added the average of imports for consumption, in 
the last two completed census periods. For example, for the 
years 1932 and 1933, the weights were based on average census 
data (plus imports for consumption) for the years 1927 and 1929, 
whereas for the years 1934 and 1935 the weights were based on 
average census data (plus imports for consumption) for the years 
1929 and 1931. In cases where census data are lacking, esti- 
mates are made of the quantities of the various commodities 
marketed, based on the best information available from govern- 
mental and reliable private sources; and these estimates are used 
as weighting factors. Commodities are added or dropped from 
time to time as they become important or cease to be important 
in the markets.^ 

During the period of depression following 1932, when the 
data on manufactured output became violently disrupted, the 
weights based upon averages of the years 1929-1931 were 
retained. Most of the prices continued in 1943 to be weighted 
by the averages of the 1929-1931 census data, but for certain 
commodity groups new weights had begun to be based upon 
special studies of those groups. Thus, in April, 1941, the Bureau 
published a study of the '^Wholesale Price Trends of Carpets and 
Rugs^^ revising its price series for this group. This study 
included new weights for the prices in this group, according to 
their ‘^importance in the country's .markets in 1939.'^ 

The quantity weight used for each of the series, the unit in 
which each is priced, and the 1939 value of each item expressed 
as a percentage of the aggregate value of all carpet and rug items 
in the Bureau's indexes are shown in Table 75. 

The use of data for 1939 departed from the “general practice 
of using the 1929 and 1931 data for weighting in the Bureau's 

^ If a price” index, as contrasted with a ” realized price” index, is 
desired, it is necessary to keep the weights constant. In constructing a 
realized price index the weights may be changed, but in revising the weights 
the index must be calculated by using both sets of weights for the over- 
lapping year or period when the change is made. By “realized price” is 
meant the dollars covering the transaction, divided by the units involved in 
the transaction. When the lack of continuity of specifications makes it 
hard to define the commodity, as with automobiles, the dollars of sales for 
each general type (sedans, coupes, etc.) divided by the number of such 
units, in other words, the “realized price,” has received the endorsement 
of competent price experts as an acceptable quotation to use in price 
statistics. ^ 
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Table 75 


Price series 

Unit in which 
priced 

Weight 

Axminster J carpets 

Lineal yard 
Each 

7,077 

2,015 

Axminster 9 X 12 rugs 

Plain velvet } carpets 

Lineal yard 
Square yard 
Each 

6,424 

Plain velvet ^ 4 ^ carpets 

7,861 

612 

Wilton 9 X 12 rugs 



Source: Bxireau of Labor Statistics, mimeographed publication, “ Wholesale Price Trends 
of Carpets and Rugs,” April, 1941, pp. 10—17. 

wholesale price indexes/^ in order to provide weights for the 
individual items that reflected their relative importance more 
nearly in accordance with present-day sales. The Axminster 
type has long been the most popular. The relative importance 
of Wilton carpets and rugs has increased considerably since the 
depression of the early 1930’s, and they have regained much of 
their earlier popularity. Prior to the depression and before 
plain velvets became popular, the importance of Wiltons, on a 
dollar basis, was almost as great as that of Axminster carpets 
and rugs. During the depth of the depression, when consumer 
incomes were greatly reduced, there was a lessened demand for 
Wiltons, apparently because they were much more expensive, 
on the average, than Axminsters. 

The study of the carpet and rug price series is presented to 
illustrate the alertness of the l^ureau in relation to the problem of 
compiling and publishing its** indexes of wholesale prices. Its 
activity extends to other groups of price series as well. For 
example, beginning with January, 1938, the results of a survey 
of farm-machinery wholesale prices were incorporated for the 
first time in the. Bureau of Labor Statistics general indexes of 
wholesale prices. In 1941 the Bureau began publishing weekly 
index numbers of waste and scrap materials, carrying the index 
back to January, 1939. In ‘‘Wholesale Prices” (June, 1941), 
the Bureau published a monthly index of standard machine-tool 
prices, including 11 types of standard nonspecialty machine 
tools, carrying the index back to January, 1937. These new 
indexes are c^culated on August, 1939, as a base; the monthly 
index of wholesale prices continued in 1943 to be based on the 
average of 1926 ^s 100. 
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Method of Computation Illustrated. The weighted aggregative 
method of computing an index number of prices is illustrated in 
Table 76, using only five price series. The five price series 
selected for illustration are those for carpets and rugs, for which 
the Bureau^s weights are shown in Table 75. A procedure 
similar to that illustrated in Table 76 is used by the Bureau, 
but with 889 price quotations instead of 5. 


Table 76. — Work Sheet Illustrating Calculation of Weighted 
Aggregative Index 
Base-period weights 


Commodity 

Average price 

Weights 

flo 

Weighted price 

1935-1939 

po 

1941 

pi 

1935-10.39 

P 040 

1941 

piCo 

( 1 ) 

( 2 ) 

(3) 

(4) 

(5) 

( 6 ) 

Axminster J carpets 

1.567 

2.014 

7,077 

11,090 

14,253 

Axminster 9X12 rugs. . . 

22.745 

27.936 

2,015 

45,831 

56,291 

Plain velvet i carpets.. . . 

1.772 

2.356 

6,424 

11,383 

15,135 

Plain velvet ^ carpets. . . 

2.581 

3.266 

7,861 

20,289 

25,674 

Wilton 9 X 12 rugs 

40.007 

50.521 

612 

24,484 

30,919 

2 = 




113,078 

142,272 

X(100/2po?o) 




= 100.00 

= 125.82 


Poi = 2;pigo/pogo 


Source: Cutts, Jessk M., and Samuel J. Dennib, "Revised Method of Calculation of 
Wholesale Price Indexes," Journal of the American Statistical Association^ Vol. 32 (1937); 
also reprinted by the Bureau of Labor Statistics ^ Serial No. 666 . 


Accordingly, the ' index number of the wholesale prices of 
carpets and rugs is 100.00 for the years 1936-1939, the base 
period, and 125.82 for 1941. The latter figure is obtained by 
taking Poi = ^Piqo/^PaQo- From the system of symbols already 
introduced, the symbolic, presentation of this form of index 
number is as follows: 


_ ^PoQo p _ ^PiQo p _ Sp2go . , . p _ ^PnQ o 

~ Spoffo’ “ Spogo' Spo3* Spog 


This is illustrated in the figures of Tables 76 and 76, the 1941 
142 272 

index being ^ 100 = 125.82. 
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Price Indexes and Quantity Indexes. Aggregative Index Using 
Given-year Weights. If given-year quantity weights are avail- 
able and used for computing an index, it must be noted that the 
following system of ratios would merely give an index of chang- 
ing aggregate values, without distinguishing which part of the 
change is due to price change and which part to quantity change: 


r> _ 


R 


2pi<7i 


ft 02 = 


Sp2^2 

Spo(Zo 


Ron 


^ Vnqn 

Spo^o 


Such an index is an index of aggregate value, made up partly of 
changes in quantity and partly of changes in price. In order, 
therefore, to extract from it that part of the change which is due 
solely to price change, the base-year prices must be multiplied 
throughout by the given-year weights. This fact makes the 
given-year weighting method a very long one to calculate; it loses 
the advantage, inherent in the aggregative index weighted by 
base-year quantities, of having a constant divider in securing the 
index. In addition, the method of weighting by given-year 
quantities necessitates the two sets of cross products for each 
year — each yearns prices multiplied by that yearns quantities 
and by the base-year quantities. Following is the symbolic 
expression of the aggregative index of prices weighted by given- 
year quantities: 


00 


tpoqo 


Poi 


2 pigi 

2pogi' 


P _ SP2^2 

Spog2 
P On 


^Pnqn 
Xpoqn ^ ^ 


Index of Quantities Weighted by Prices. An advantage of the 
given-year weighting method is that an index of quantities 
weighted by prices can be obtained as a by-product, with com- 
paratively little additional calculation. The same numerators as 
those used in Eq. (7) can be used to calculate an index of quan- 
tity change weighted by given-year prices. For each year, 
given-year prices are multiplied by base-year quantities, and 
these aggregat^es are used as dividers. This will give an index of 
quantity weighted by given-year prices, as follows: 
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Qoo 




Qoi 


Spigo' 


Qo2 


2p2g2 ^ 
Sp2?0 ’ 


( 8 ) 


Unfortunately, this advantage in the given-year weighting 
method is largely imaginary because the quantity data are not 
available soon enough for short periods of time to make it prac- 
ticable to construct monthly or weekly indexes. In any case, it is 
also possible to obtain a quantity index weighted by given-year 
prices, using the following equation, which would provide a 
much simpler method: 


Qoo 


~^PoqQ 


Qo 




n _ Spog2 
tpoqo 


( 9 ) 


Quantity indexes are constructed, however, by other methods, 
usually mth more general application of stratified sampling and 
with other weights than prices, largely because of the difficulty 
of obtaining quantity data. Not only arc these other methods 
more convenient to calculate, but they make it possible to 
handle matters having to do with weighting and bias in the 
results. In using equations like Eqs. (8) and (9), it is often very 
difficult to appraise the inaccuracies due to bias inherent in the 
method. 

Quantity Indexes and Business Barometers. Indexes of 
Quantity of Trade or Production. Several statisticians and 
economists made attempts, especially in the years immediately 
following the First World War, to construct an index that would 
trace variations in the physical volume of production or trade. 
Pioneer efforts to construct such indexes, based upon scant 
material and with little in the way of a statistical theory to 
guide them, were made before the First World War by Wesley C. 
Mitchell, Irving Fisher, and Edwin W. Kemmerer. During the 
war and postwar period important progress was made, especially 
by Edmund E. Day, Warren M. Persons, and others. In 1923, 
the latter published an index of trade for the United States, 
beginning with the year 1903.* The index of production is 
based very heavily on the index of employment; it might there- 
fore fail to reflect properly the results of technological advance. 


* “An Index of Trade for the United States,” Review of Economic Statistics^ 
Preliminary Vol. 5 (April, 1923), pp. 71-78. Cf. also Garfield, Frank R., 
“General Indexes of Business Activity,” Federal Reserve Bulletin, Vol. 26 
(1940), pp. 495-501. 
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After these experiments by pioneering individuals, several gov- 
ernment agencies and privately financed research organizations 
took up the task of developing indexes of trade and production. 
The most widely known and now most currently used index of 
industrial production for the United States is that compiled and 
regularly published in the Federal Reserve Bulletin by the Research 
Division of the Board of Governors of the Federal Reserve Sys- 
tem. This index is compiled from 95 individual series of monthly 
data, representing about 85 per cent of the total industrial pro- 
duction of the United States. The series include 22 durable- 
goods manufacturing industry series, 63 nondurable-goods 
manufacturing industry series, and scries representing production 
of fuels and metals. This index is also regularly reproduced in 
the Survey of Current Business^ published by the United States 
Department of Commerce.^ A reproduction of the entire index, 
with its component parts, 1923-1940 by months, with the aver- 
age 1935-1939 = 100 as the base, can be found in the Federal 
Reserve Bulletin,^ 

It is characteristic of the indexes of physical volume of produc- 
tion or trade that they consist of combinations of various series 
upon the basis of stratified sampling, the weights for the repre- 
sentative series being devised upon a priori knowledge concern- 
ing the importance of certain groups of activity in relation to the 
whole of business activity. These indexes treat the separate 
series statistically before putting them together. For example, 
they remove seasonal variation and trend from the separate 
series and thus average together the cycles of the various separate 
series into the composite. The method of averaging employed 
is generally the aggregative method, although since 1940 the 
Federal Reserve Board uses aii average of relatives weighted by 
quantities so that the final result is equivalent to what would be 
obtained by using the weighted aggregative method.® 

"^Federal Reserve Bulletin, Vol. 13 (February, March, 1927), Vol. 17 
(February, September, 1931), Vol. 18 (March, 1932); for adjustments made 
necessary by the 1942 world war, see Vol. 27 (1941), pp. 878-881; c/. Survey 
of Current Business, Vol. 20 (1940), pp. 11-17. 

* Vol. 26 (1940), pp. 825-882; sec also Answer to Critics of the Index,’’ 
pp. 1047-1049. See also Woodliep, Thomas, and Maxwell R. Conklin, 
“Measurement ^f Production,” Federal Reserve Bulletin, Vol. 26 (1940), 
pp. 912-924. 

®See Woodliep and Conklin, op. ciL 
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Computation of Weights Illustrated. The index of manufac- 
tures published by the Federal Reserve Board is weighted by the 
total value added by manufacture in the case of all manufactur- 
ing industries, and the index of mineral production is weighted 
by value of mineral products. The sum of these two is the 
index of production. The individual production series of which 
the manufacturing index is composed are weighted, as nearly as 
possible, according to the same principle.^ Accordingly, the 
total value added by manufacturing industries in 193^7, as 
reported by the United States census, was distributed among the 
16 groups represented in proportion to the value added for each 

Table 77. — Relative Importance op Industry Groups and Selected 
Industries Included in the Federal Reserve Board Index 
OF Industrial Production 
(Per cent of total with 1937 weights) 


Series 1937 Weights 

Industrial production 100.00 

Manufactures 84.80 

Durable manufactures 37.93 

Iron and steel 11.00 

Machinery production 10.81 

Transportation equipment 5.92 

Nonferroua metals and their products 2.81 

Lumber and its products 4.39 

Stone, clay, and glass products 3.00 

Nondurable manufactures 46.87 

Textiles and their products 11.22 

Leather and its products 2.28 

Manufactured food products 10.92 

Alcoholic beverages 1.84 

Tobacco products 1.24 

Paper and its products ^ 3.13 

Printing and publishing 6.44 

Petroleum and coal products 2.14 

Products of chemicals 6.27 

Rubber products 1.39 

Minerals 15.20 

Fuels 13.01 

Metals 2.19 


Source: Condensed from Federal Reserve Bulletin, Vol. 26 (1040). p. 910. 

' ^ For industry series in which census data on value added by manufacture 
are not available other criteria had to be used, such as total value of manu- 
factured product, raw materials consumed, or man-hours worked. See ibid.y 
pp. 917-918. 
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group, and then derived group totals were subdivided among 
industries and finally among individual products in a similar 
manner. Each individual series is thus assigned a hypothetical 
value-added figure, which is then divided by the relative of the 
1935-1939 compared with 1937 in order to convert 1937 value- 
added figures to 1935-1939 base. The derived 1935-1939 figures 
for each series are then expressed as percentages of their own 
total to obtain the weights. These percentages represent the 
estimated relative importance of each series in the 1935-1939 
base period and are the weights applied to the relatives in 
combining them into the index of production. Table 77 repro- 
duces a summary of these weights. 

Using the weights shown in Table 77, the equation for the 
Federal Reserve Board’s index of industrial production is as 
follows: 

in which 


w 


PsiQo 

2p37go 


P 37 represents the value (or value added) per unit of output in 
the weight-base period. 

Barometers, or Indexes, of General Business Conditions, Some 
composite indexes purport to be barometers or indexes of business 
and trade in general. These indexes are of two types: (1) A 
single series is sometimes believed to be a barometer of general 
business conditions and (2) a number of indexes of trade activity 
are combined in order to measure general business conditions. 

Of the first type, the most* prominent one at present is prob- 
ably the index of electrical-power production, which is compiled 
from quantity figures published by the Geological Survey. The 
index of activity in the steel industry at one time was looked 
upon as a good barometer of general business conditions because 
so many industries are dependent upon steel or steel products. 
The trends in the average of security market prices are sometimes 
taken as a barometer of coming business conditions, or at least 
as a measure of existing conditions. In wartime, the security 
markets ofteh reflect conditions and war information that are 
not generally publicized. 
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A good example of the second type of index of general business 
conditions is that published currently by the New York Times and 
formerly by the Annalist This index has been widely used 
and until October, 1937, was reproduced in the Survey of Current 
Business^ published by the United States Department of 
Commerce. 

A more comprehensive example of this second type of index 
of business conditions is one that has evolved from the pioneer 
work of Carl Snyder, whose procedure was based upon the 
theory that the fluctuations in total bank clearings are made up 
of two variables: (1) price change and (2) change in physical 
volume of trade. By constructing a price deflator, which has 
already been discussed, and then by using this deflator to cancel 
out from aggregate bank clearings that part due to price changes, 
he sought to obtain an index of physical volume of trade for 
the years 1875-1924.^ Modifications and refinements were made 
in the construction of this index by Leroy M. Piser, so that it 
was known as the Snyder-Piser index of volume of trade for the 
United States. It included 89 series, classified as follows: pro- 
ductive activity, 46 series; primary distribution, 13 series; 
distribution to consumer, 8 series; financial activity, 6 series; 
general (such as life insurance, postal receipts, electrical-power 
corporations, farmers, and communication), 5 series; and finally 
debits outside New York City. Thus it came to be based upon 
the principle of stratified sampling. This index of volume of 
trade and production is published monthly by the Federal 
Reserve Bank of New York in its Monthly Review of Credit and 
Business Conditions ^ 

Various forecasting services compile their respective indexes of 
business conditions according to their particular interpretation 
as to what should best be included in such an index and how 
best to weight various factors. Carefully worked out indexes 
of the marketing of farm products and forestry products are now 
available as a result of the efforts of the Bureau of Agricultural 
Economics in the United States Department of Agriculture. 
These are reproduced in the Survey of Current Business. The 

1 Snyder, Carl, New Clearings Index of Business for Fifty Years/' 
Journal of the American Statistical Association, Vol. 19 (192^4), pp. 329-335. 

* Johnson, Norris Q., ^^New Indexes of Production and Trade," Journal 
of the American Statistical Association, Vol. 33 (1938), pp. 341-348. 
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Bureau of Foreign and Domestic Commerce also publishes 
indexes of domestic commodity stocks, and of world stocks of 
certain outstanding industries, which constitute good barometers 
of the related business conditions.^ 

ADJUSTMENT OF INDEXES TO BENCH MARKS 

Ideal Conditions for Stratified Sampling Nonexistent. In order 
to produce results approximating those of true random sampling, 
conditions favorable to random sampling in the subgroups or 
strata must exist. For one thing, this means that there must 
be large numbers of items from which to draw within each sub- 
group. It also means that the number of sample items must be 
sufficient to avoid the disadvantages of small samples. The law 
of large numbers must be given opportunity to produce the 
results of true random selection within each stratum ; such is the 
sine qna non of truly successful stratified sampling. Under such 
conditions the method of selection causes no accumulation of 
bias. Such ideal conditions do not exist with reference to any 
known index number, not even the Bureau of Labor Statistics 
index of wholesale prices, which contains a total of more items 
than any other index. 

Nevertheless, the pattern of stratified sampling can mth 
considerable advantage be adopted as the guide to procedure in 
the construction of indexes of all types. Following this pattern 
the investigator first works out a system of classification of the 
data for which an index is to be compiled. Using the subgroups 
of this classification, he can then proceed according to the prin- 
ciples of stratified random sampling so far as it is possible to do 
so. When he finds that conditions ideal for random sampling in a 
subgroup fail to exist, the investigator must resort to subjective 
means to secure results that he believes will be representative. 

Inasmuch as all indexes contain, in some part, data that have 
been collected and processed by the use of such subjective 
methods, employed in the absence of ideal sampling conditions, 
it is desirable wherever possible for the statistician to find bench 
marks with which he can compare the results of his sampling 

1 Survey of Current Business^ Vol. 20, Annual Supplement (1940), pp. 
8a-164. For a more complete discussion of barometers of general business 
conditions see Wesley C. Mitchell, Business Cycles — The Problems and Its 
Setting, (1928), .pp. 291-330; Joseph L. Snider, Business Statistics; and 
Garfield, op. cit. 
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procedure. A common-sense appraisal of the over-all result is 
the most generally used bench mark to judge whether or not the 
results are satisfactory; but this method presupposes an unusual 
amount of a priori knowledge and of scientific critical judgment 
on the part of the statistician. Sometimes more objective bench 
marks may be found to aid the statistical worker along his 
thorny path. These will be illustrated in the ensuing sections. 

Reasons Why Indexes Require Adjustment. The reasons why 
indexes require adjustment^ to bench marks do not necessarily 
arise from faulty application of the method of stratified sampling. 
They arise from the nature of the universe from which the sample 
is taken. In connection with most types of data collected for 
the construction of indexes, the universe is a discrete rather than 
a continuous one; in other words, the universe consists of com- 
paratively small numbers of units, each of which constitutes a 
comparatively large proportion of the whole universe. Often 
they cannot be considered as representative of each other. 

When such a comparatively small universe is subdivided, in 
order to apply the stratified sampling technique, the strata con- 
stitute universes with still smaller numbers. Added to this is 
the usual fact that only a portion of this remaining small num- 
ber is accessible to the data collector; in some cases, unavoidable 
bias itself constitutes a part of the reason for the accessible por- 
tion. Under such circumstances it is almost impossible to 
realize the essential condition of randomness of selection in the 
respective strata, and consequently stratified sampling technique 
gives less satisfactory results. 

Such is the situation with respect to sample data collected 
from business firms, especially manufacturing enterprises. In 
some of the subdivisions, corporate enterprise is on so large a 
scale that only a few firms represent a large- portion of that 
stratum. In all subdivisions, the size of the sample return 
measures, not only changes in the trends it is desired to measure, 
but also success or failure of the collecting agency in persuading 
firms to report. The statistical technique of comparing iden- 
tical firms from month to month reduces but does not altogether 
obviate the cumulative error resulting from this weakness. 

In addition, growth of an industry, and hence growth in pay 
rolls, in output, in stocks of materials, or in whatever is the sub- 
ject of investigation, occurs not only in existing firms; but part 
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of the growth is in the rise of new firms in the industry. In some 
strata, perhaps the steel and machinery industries, expansion or 
contraction of existing firms (and hence those reporting in a 
stratified sample) may accurately reflect proportionately a rise 
or fall in business in that strata. But in other strata, like some 
branches of the textile industry or the food industry, the expan- 
sion or contraction of existing firms (and hence those reporting 
in a stratified sample) may not at all reflect proportionately the 
rise or fall in business. 

In heavy industry, where plant and equipment constitute a 
large proportion of the business investment, cyclical changes of 
the sample might well be much greater than cyclical changes in 
the universe. This would follow if the large firm, with heavy 
investment in plant and equipment, tended to curtail production 
instead of lowering prices when faced with declining business 
prospects. 

In some branches of the clothing and food industries, in which 
small investment in plant and equipment and large numbers 
of small firms predominate, cyclical changes may be smaller in 
the sample than in the universe. The birth of new firms or 
resumption of activity by old firms is the principal manner of 
expansion in such strata. The death of old firms is the principal 
means of decline. The reporting firms are quite likely to be the 
ones that would not die at a rate so rapid as the average rate in 
the industry. 

Circumstances like those just described constitute only an 
illustration of the type of problem facing the statistician, who 
must continually endeavor to improve the sample of reporting 
firms. Great as his efforts and ingenious as his imagination, 
may be the resulting sample is likely to show bias. 

For bench marks in connection with adjustment of indexes 
of employment and pay rolls, statisticians have made use of the 
successive issues of the census of manufactures, often called the 
Biennial Census of Manufactures, which appeared in 1914, 1919, 
and each odd year thereafter, including 1939. For years after 
1939 it should be possible to get similar bench-mark data from 
the records of the Social Security Administration. 

By using the census-of-manufactures data as bench marks, it 
has been possible to check up on the monthly or weekly sample 
results obtained^ by the sampling process and to adjust them for 
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any bias that is disclosed by such a check. As each new set of 
manufacturing census data became available, which was about 
two years from the time it was taken, such indexes could be 
adjusted to the census data for errors that had accumulated since 
the last census. In the meantime, the results of the sample were 
relied upon as the best available information; and at the same 
time subjective sampling procedures were continually studied 
with a view to improvement wherever possible. To this end, 
the adjustment procedure often discloses areas in which the 
sampling results are especially in need of improvement. 

The method of making such adjustment will be illustrated, 
not so much as a valuable statistical device in itself, but as an 
example to the student of the care and attention to form, pro- 
cedure, cross checking, and the like, reciuircd of a good statis- 
tician. Thus the following instructions, together with the form 
used, are presented as an exhibit to help the student visualize 
how a statistician plans his work and works his plan. 

Method of Adjustment Illustrated, A good example of an index 
adjusted to United States census bench marks is the monthly 
index of pay rolls and employment published by the Bureau of 
Labor Statistics. The method of adjustment is reproduced by 
permission of the Bureau of Labor Statistics and is applied to a 
monthly index of pay rolls in the metal stamping, enameling, 
and japanning and lacquering industry of New Jersey.^ The 
adjustment is carried out on Form BLS 1238, June, 1940, pre- 
sented here in Table 78. 

The raw data, which have been adjusted for the 1937 and 
earlier census figures or bench marks, but remain to be adjusted 
for 1939 census data, are entered by months in columns (3), 
(9), and (13); the sums and averages {S and I) for each of these 
columns are then entered. In column (17), using the lower part 
entitled ‘‘Formula if L is not available,”^ enter the United 
States census figure for 1937 and 1939 (Zi and Z 3 ). Calculate the 
ratio ZbIZi, and enter in the space provided, therefore, 0.933280 

^ The work on New Jersey data was done by a Work Progress Administra- 
tion project sponsored by the New Jersey State Labor Department for the 
construction of monthly indexes of pay rolls and employment in manufac- 
turing industries, January, 1923-December, 1940. One of the authors was 
called upon to serve as consultant and director of the project. 

*The part labeled “Formula if L is available” is used with a blanket 
adjustment method involving several census periods. 
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in Table 78. Copy Si from column (3), in the space provided 
in column (17), that is, in Table 78, 883.82; this number mul- 
tiplied by the ratio Z^/Zi equals Sg, entered in the space provided, 
in Table 78, that is, 824.85. 

In column (13) calculate Rz by finding 'Znl for the year 1939 
[the n for each month is found in column (12)]. SnZ = January 
nl + February nl -h March nZ • • * + December nZ, includ- 
ing an nl for each of the 12 months. 

In column (18) enter /Sg, copying it from the last row of 
column (17). Enter Sz, copying it from column (13). Subtract 
Sz from /Sg and enter the difference in the next row of column (18). 
Copy Rz [from column (13)] in the next row of column (18). 
This value, Rzj divided into the figure in the preceding row, 
Sg — Szj gives the value of d. Enter d in the last row of column 
(18). This is the adjustment parameter. It is now used to 
adjust the series by months as follows: 

In columns (4), (10), and (14) enter 1 + nd for each month. 
These values should be obtained on a calculating machine as 
follows: Put 1.000000 in the machine, and add it. Put d on the 
keyboard, being careful to place it correctly for the decimal 
point. Subtract once, and record the answer for 1 + ^ in 
January, 1937. Subtract twice more (making —3 altogether), 
and record the answer for 1 + nd in February, 1937. Subtract 
twice more (making —5 altogether), and record the answer for 
1 + nd in March, 1937. Subtract once more (making —6 
altogether), and record the answer for \ nd in April, 1937. 
The values for 1 rtd in May, June, July, and August are the 
same as 1 + nd for April, March, February, and January, 
respectively. They can be fopnd by reversing the above process 
on the calculator until 1,000000 remains in the machine and d 
on the keyboard. For September add d once and record 1 + nd 
for that month. Add d four more times (making 5 altogether), 
and record \ + nd for October. By following a similar pro- 
cedure, guided by n in columns (2), (8), and (12), values of 
1 + nd are calculated and entered for each month through 
December, 1939. 

Enter in columns (5), (11), and (15) the indexes in columns 
(3), (9), and |13), multiplied, respectively, by the 1 + nd for the 
corresponding month in columns (4), (10), and (14). Add for 
each year, and enter sums, which equal K, and Sg. Divide 



Table 78. — Adjustment of Monthly Indexes to Biennial Census Figures 

Continuation method 

U.S. Bureau of Labor Statistics Form B.L.S. 1238 June, 1940 
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The revised indexes appear currently in the monthly bulletin on Employment and Pay Rolls, Serial No. R589, and the back figures are published in 
full from 1919 to date in Federal Reserve Bulletin, Vol. 24 (1938), pp. 838-866. Cf. also recommendations of the Committee on Government Statistics 
in “Recent Progress in Employment Statistics” by Aryness Joy, Journal of the American Statistical Association, Vol. 29 (1934), pp. 355-371. 
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each of these by 12, and enter the quotients in the next row. 
Thus, in Table 78, K = 883.88, = 654.19, and S's = 824.84; 

and divided by 12 these become 73.66, 54.62, and 68.74. 

In column (16) enter the Si and iCi, as indicated in the table; 
subtract K from Si, and enter the difference. Divide this dif- 
ference by 40, and enter the quotient in the next row. This 
figure is h, the parameter for the second adjustment. If Si — X 
is smaller than 0.05, regardless of sign, do not calculate h. In 
the problem illustrated, Si — if = —0.06; hence h is calculated 
and found to be —0.002. 

If h is calculated, enter in column (6), for each month, mh; 
that is, in January enter A, in February enter 2h, in March enter 
3/i, in April enter 4/i, in May to August, inclusive, 5h; thereafter 
declining each month, with 2h in November and h in December. 
Enter in column (7) the sum of the figures for the respective 
months in columns (5) and (6). The sum of colpmn (7) is equal 
to S[. If h is not used, the sum of column (5) is taken as SJ; that 
is, if h is ignored, K = S(. 



CHAPTER XX 

RATIONAL BASIS OF THE ANALYSIS OF TIME SERIES 


Elements of Variation in Time Series. The elements of 
variation contained in an ordinary time series may be illustrated 
by building up a hypothetical time series. 


The first element in the time series is 
long-time growth^ or trend. People living 
in the twentieth century are accustomed 
to the idea that things grow, or progress. 
Table 79, column (1), shows years and 
months for 3 years, and column (2) shows 
a set of figures that grow at the constant 
difference of 0.2 per month. This column 
of figures is plotted in Fig. 135 {AA') and 
is a picture of the growth, or trend, in 
the hypothetical time series. 

Time series are also likely to have 
seasonal variations. Many economic and 
social phenomena vary from season to 
season in a similar manner each year. 
This is most evident in the case of 
activities affected by weather, such as 
agricultural production; but such patterns 
of seasonal variation occur in other events 
as well. Suppose the seasonal variation 
in the hypothetical time series is such that 
November is usually 58 per cent above 
the average month, July is usually only 
43 per cent as large as the average 
month, etc., as indicated in Table 80 
showing the index of seasonal variation 
for the hypothetical time series. 



Fiq. 135. — Two of the 
component parts of a 
hypothetical time series. 
A A' = annual trend, BB' 
— assumed trend, modi- 
fied by annual seasonal 
variation. 


Figure 136 is a graph of this seasonal variation as it occurs year 


after year, 1943,^944, and 1945. 
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Table 79. — Hypothetical Time Series Bthlt Up 


(1) 

(2) 

■ (3) 

(4) 

(6) 

Year and month 

Growth or 
trend 

Seasonal 
variation 
put in 

The cycle 

The cycle 
put in 

1943: 





January 

1.0 

1.45 

100 

1.45 

February 

1.2 

1.49 

101 

1.50 

March 

1.4 

1.41 

102 

1.44 

April 

1.6 

1.33 

103 

1.37 

May 

1.8 

1.19 

106 

1.26 

June 

2.0 

1.04 

109 

1.13 

July 

2.2 

0.95 

112 

1.06 

August 

2.4 

1.01 

115 

1.16 

September 

2.6 1 

2.42 

120 

2.90 

October 

2.8 

4.00 

122 

4.88 

November 

3.0 

4.74 

124 

5.88 

December 

3.2 

5.02 

125 

6.28 

1944: 





January 

3.4 

4.93 

126 

6.21 

February 

3.6 i 

4.46 

130 

5.80 

March 

3.8 

3.84 

140 

5.38 

April 

4.0 

3.32 

150 

4.98 

May 

4.2 

2.77 

160 

4.43 

June 

4.4 

2.29 

180 

4.12 

July 

4.6 

1.98 

200 

3.96 

August 

4.8 

2.02 

210 

4.24 

September 

5.0 

4.65 

160 

7.44 

October 

5.2 

7.44 

140 

10.42 

November 

5.4 

8.53 

112 

9.55 

December 

5.6 

8.79 

111 

9.76 

1945: 





January 

5.8 ‘ 

8.41 

109 

9.17 

February 

6.0 

7.44 

107 

7.96 

March 

6.2 

6.26 

105 

6.57 

April 

6.4 

5.31 

103 

5.47 

May 

6.6 

4.36 

101 

4.40 

June 

6.8 

3.54 

99 

3.50 

July 

7.0 

3.01 

90 

2.71 

August 

7.2 

3.02 

85 

2.57 

September 

7.4 

6.88 

75 

5.16 

October 

7.6 

10.87 

65 

7.07 

November 

7.8 

12.32 

6Q 

7.39 

December 

8.0 

12.56 

55 

6.91 
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Table 80. — Seasonal Variation 
(In percentages of the average month) 


January 

145 

May 

66 

September 

93 

February 

124 

June 

62 

October 

143 

March 

101 

July 

43 

November. . . . 

158 

April 

83 

August 

42 

December 

157 








When the seasonal variation and trend are combined, a line 
like SB' in Fig. 135 is produced; the data are shown in column 
(3) of Table 79. To obtain each monthly value for the line 
BB' each monthly coordinate of the line AA\ that is, the growth 
element in the time series, has been 
multiplied by the index of seasonal vari- 
ation for the corresponding month. This 
has the effect of redistributing the total of 
the 12 monthly figures of the growth line 
in such a manner as to make them prop- 
erly reflect the seasonal element. Thus 
the trend figure for January is multiplied 
by 1.45 (or 145 per cent) while the April 
trend figure is multiplied by 0.83 (or 83 
per cent). 

A third element of variation in time 
series is cyclical fluctuation^ which may 
extend over several years. For example. 

Fig. 137 shows the rising and the falling 
movement of a cyclical fluctuation by 
months that occurs over a period of 3 
years; this* is shown also in column (4) of 
Table 79. In column (5) of Table 79 and 
in Fig. 138 are shown the effect of combining also the cyclical 
movement. The figures for the respective months are now 
altered according to whether the cycle is carrying them upward 
or downward, and the percentage figures for the cycle, shown in 
column (4), depict this upward and downward swing of the cycle. 
The cycle is put into the data by multiplying each monthly 
figure in column (3) by the corresponding monthly index of the 
cycle found^in the same row of column (4). The results are 
shown in Fig. 138, which is the final hypothetical time series; 
the data for it are in column (5) of Table 79. 



Fig. 136. — Seasonal 
variation in the hypo- 
thetical time series. See 
Table 80. 
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Two important effects of combining the growth element and 
the seasonal-variation element are noticeable from Fig. 135. 
In the first place, the combination has a tendency to obscure the 
trend. It is still clear in line that there is a rising tendency, 
but the wide sweeps of the seasonal fluctuations tend to conceal 
the exact nature of the rise; for mthout the line A A' in Fig. 135 it 
would be difficult to visualize precisely what the slope of this 
trend actually is. In the second place, the combination definitely 



Fig. 137. — The cycle in the hypo- Fig. 138. — All three component ele- 
thetical time series. See Table 79, monts of the hypothetic?!.! time series 
column (4). combined. See Table 79, column (5). 


distorts the shape of the seasonal variation, in two ways: (1) It 
causes the valleys and peaks to be thrown out of line arith- 
metically. (2) It minimizes the size of the seasonal variation 
where trend is low and exaggerates the size of the seasonal 
variation where trend is high. 

From Fig. 138 it is clear that the effect of including the 

cyclical movement is further to obscure the trend or growth 

element and to distort still more the character of the seasonal 

variation. It is in approximately this condition that most time 

series exist in their raw state. Raw data of time series contain 

* 
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in varying degrees elements of all three of these types of fluctua- 
tion. Some have little seasonal variation, some have a 
deal, and some have none. Many have rising trends following 
population and general growth, while a few have declining 
trends because they represent decaying or disappearing types 
of economic or social activities. In practically all time series, 
cycles of varying length and varying amplitude occur. 

In addition to the three elements illustrated by the hypo- 
thetical case, most time series contain fluctuations due to unusual 
or residual occurrences^ such as the effects of floods, storms, or 
strikes. This gives four elements or types of fluctuation and 
these four types of fluctuation serve as a good classification for an 
empirical start in the analysis of time series.^ 

GENESIS AND PURPOSES OF THE TIME-SERIES ANALYSIS 

The hypothetical problem just illustrated consisted in a 
synthesis. The study of time series is analysis — a reversal of the 
procedure that has just been demonstrated. This breaking up 
of time series into its constituent elements, and the various com- 
plications involved, constitutes the subject of time-series analysis. 

Why do economists, social scientists, and statisticians analyze 
time series? What started them along this line of procedure, 
and what are its advantages? The answers to these or other 
questions as to the significance of time-series analysis have in 
general a threefold basis: (1) interest in the population problem 
and the discovery of the law of organic growth, (2) concern for 
the general problem of the so-called ^‘business cycle, and (3) 
preoccupation with the variety of problems associated mth 
seasonal influences upon business and social life. 

Rational Trends. . Historical Background, In 1798, Thomas 
Robert Malthus, a minister of the gospel and a political econo- 
mist, wrote an Essay on the Principle of Population^ in which he 
advanced the fundamental principle that the law of growth of 
population is geometric — population, he said, tends to grow in a 
geometric progression. The curve representing population 

^ This is the conventional classification of types of fluctuation that occur 
in time series; it was presented in detail by W. M. Persons of the Harvard 
Committee of Economic Research and published in the Review of Economic 
Statistics^ Preliminary Vol. 1. See also articles by the same author in the 
American Economic RevievS, Vol. 6 (1916), pp. 739-769, and Publication^ 
of the American Statistical AssodatioUj Vol. 12 (1917), pp. 602-642. 
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growth would accordingly be a positive exponential curve, 
similar in character to the curve representing the growth of a 
principal sum of money at compound interest. 

While some of the doctrines of Malthus regarding the controls 
to population growth are no longer accepted as tenable, the 
fundamental principle of the tendency of population to grow 
geometrically has not only been accepted with regard to popula- 
tion theory but has been widely applied in other fields. To 
people of the twentieth century this principle seems almost 
axiomatic, for they are familiar with the history of the nineteenth 
century, when the statistics show such a growth of population 
and such a development of many kinds of activities according to 
this principle of geometric progression. 

The principle of growth was not so obvious to those living at 
the time of Malthus, nor to those living in the middle of the 
nineteenth century. Consequently, it was startling and new to 
see the same principle applied to growth in certain economic and 
social phenomena, as was done by William Stanley Jevons, an 
English economist, in his celebrated book on The Coal Question 
(1865). Chapter IX of that book is entitled Of the Natural Law 
of Social Growth. In this he propounds the idea that many of 
the phenomena of economic and social life follow the same law 
of organic growth as population. In some, the progressive rate 
of geometric growth is greater than that of population; in some, 
less; but in all the growth is geometric. In another chapter of 
the same book, Jevons applied this principle and tested it with 
reference to England's progress in industry. His contribution 
was of the nature of the proposal of a hypothesis that served as a 
challenge to mathematically minded economists like himself 
and others and soon stimulated the development of ideas as to 
how best to write the equation for the curve that would repre- 
sent growth of population. By such an equation, it was thought, 
population could be forecast far into the future as well as for 
intercensus years. 

Population Curves. In 1891, A. S. Pritchett suggested that 
an equation of the form P — a + ct^ dt^ would fit the 
curve of population growth. The subject of equations for the 
population curve became one of wide concern to* population 
students, economists, and scientists in general, as well as of prac- 
tical interest in obtaining accurate estimates • of population 
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between the dates of taking the census. In 1907, Raymond 
Pearl proposed that the form of this equation should be 

P — a + bt + ct^ + d log L* 

The problem was again approached by G. Udny Yule, an English 
statistician, in 1925;^ and in later years the discussion was 
continued.^ 

Perhaps the most striking contribution on the subject is that 
of Raymond Pearl and Lowell J. Reed, who in 1920 advanced the 
idea that the population curve should not continue to rise 
indefinitely but should level off after some period of time, and 
that thus the population curve showing the law of growth wo|^ld 
not follow the compound-interest curve indefinitely. Rather, 'it 
would resemble the curve shown in Fig. 145 in Chap. XXI. 

The mathematical characteristics of this curve and its equa- 
tions are presented by these joint authors in the Journal of the 
Royal Statistical Society for 1927. f As will be clear from a glance 
at Fig. 145, the shape of this curve indicates about the growth of 
population or the law of organic growth that the first period 
of relatively slow arithmetical growth is followed by a period 
of very rapid arithmetical growth but that finally a period of 
slowing down of this rapid arithmetical growth occurs so that the 
curve at the top assumes an asymptotic character. 

Early Population Theories. Qu6telet remarked that Malthus's 
doctrine resolved itself essentially to the proposition that, under 
the most favorable industrial circumstances, population could 
grow no more rapidly than in an arithmetical progression, 
although, of course, he stated the geometric law of growth as a 

* Knibbs, George H., '‘The Laws of Growth of Population,'' Journal of 
the American Statistical Association, Vol. 21 (1926), p. 381. 

^ Journal of the Royal Statistical Society, Vol. 88 (1925), pp. 1-62, which 
contains an excellent historical summary of the problem of curve fitting to 
population growth. 

* Reed, L. J., and Raymond Pearl, “On the Summation of the Logistic 
Curve," Journal of the Royal Staiistical Society, Vol. 90 (1927), pp. 729-746. 
The mathematics of the curve was discovered, say the authors, by Verhulst, 
according to Qu^telet writing in 1838, and was again applied to population 
by Pearl and**Reed in 1920. Cf, Pearl, Raymond, Studies in Human 
Biology (1924), Chap. XXIV, The Curve of Population Growth. 

t Op, cit. > 
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tendency. It could grow only in arithmetical progression 
because it would be kept down to that rate by the fact that sub- 
sistence grows only in arithmetical progression. He also pointed 
out that the theory of population growth up to the time he was 
writing (1836) had not been developed to the point where it 
could be considered ‘‘dans le domaine des sciences math4- 
matiques, auquel elle semble sp4cialement devoir appartenir.^’^ 
Even so, Qu6telet himself never went to the point of developing a 
mathematical equation expressing the law of population growth, 
although in other ways his contributions as a population theorist 
are outstanding. However, he did reach the point of suggesting 
i that the law of population growth is like that of a body traveling 
through a resisting medium that tends to attain a limiting 
velocity. 

Yule suggested that this analogy probably inspired Verhulst, 
professor of mathematics at the Ecole Militaire, to a controversy 
with Qu6telet on the subject. The problem of devising a 
mathematical law of population growth was actively studied by 
Verhulst for a number of years. He fitted logistic curves to the 
population histories of several countries for as many years as 
data were available, but the limited amount of data did not 
inspire confidence in the results.^ This work of Verhulst seems 
to have been forgotten until the time of the Pearl-Reed studies 
of 1920. Pearl and Reed^s discovery of the law of population 
growth in the mathematical form developed by them was 
independent. As Yule says, they seemed to have been unaware 
of the formulation by Verhulst. 

Basis for Rationalizing Trends. The attempt is made by 
students of the law of population growth to rationalize the 
fitting of such a logistic curve to experienced growth of popula- 
tion in many parts of the world, and at different times, by basing 
their reasoning upon the following points: 

1 Sur Vhomme (Bruxelles, Louis Hauman et Comp. 1836), pp. 283, 287. 

* Notice sur la hi que la population suit dans son accroissement (correspond- 
ance math4matique et physique publi4e par A. Qu4telet, 1838), tome 10 
(also numbered tome 2 of the third series), pp. 113-121; and by the same 
author, ‘‘Recherches math4matiques sur la loi d^accroissement de la popula- 
tion,” Nouveaux m^moires de VAcadimie Royale des Sciences et Belles-Lettres 
de Bruxelles^ tome 17 (1845), pp. 1-38; ‘‘Deuxidme m4muire sur la loi 
d’accroissoment de la population,” t6td., tome 20 (1847), pp. 1-32. Citations 
from G, Udny Yule, op, ci(., p. 57. , 
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1. The construction of such a curve through the plotted points 
showing actual population growth in a large number of places 
produces a good fit. 

2. Biological experiment under controlled’ conditions, with 
other species than man, produces increases in numbers in a 
manner following such a curve. Thus Pearl made such an 
experiment with fruit flies under controlled conditions. ^ 

3. Studies of trends in birth rates and death rates, in their 
relation to population growth, appear to fit into the theory that 
the law of population growth follows this curve. 

4. Studies of death rates by age distribution of the population 
and the relationship between age composition and total death 
rate and birth rate of a population appear to fit into the law of 
population thus formulated. ^ 

5. While it is true that the parabolas of earlier writers fit 
empirically the population growth wherever tried, such a curve 
fit cannot be rationalized, because the extension of the parabola 
goes on to infinity. On the other hand, th(i logistic curves of the 
Verhulst, Pearl-Reed, or Gompertz variety approach a limit in an 
asymptotic manner, which seems to be a more rational manner in 
which to view the law of population growth. 

6. The asymptotic limit that it is assumed population is 
approaching can be closely approximated by study of the circum- 
stances surrounding the determination of the factors influencing 
population growth. 

Thus, it is recognized in this theory of the law of population 
growth that should technological changes comparable with the 
industrial revolution occur, the asymptotic limit might have to 

^ Cf. Pearl, R., The Biology ^of Death, pp. 253-254. Cited in Yule 
op. ci7., p. 22. 

* These ideas have reached the general public as well as the scientific 
group, through such articles as Robert A, Kuezynski, ^^The World’s Future 
Population,” The New Republic, May 7, 1930; Aaron Hardy Ulm, ”Our 
Falling Birth Rate Is Studied by Experts,” The New York Times, Mar. 2, 
1930; Louis I. Dublin (Statistician of the Metropolitan Life Insurance Com- 
pany), America Approaching Stabilized Population,” The New York 
Times, Mar. 4, 1930; and by the same author, ‘‘Our Aging Population: Its 
Vital Effects,” The New York Times, Jan. 4, 1931. Cf, also Dublin, 
Louis, I., and Alfred J. Lotka, ‘^On the True Rate of Natural Increase,” 
Journal of the»American Statistical Association, Vol. 20 (1925), pp. 305-339; 
and Dublin, I/OUIs I., “The Statistician and the Population Problem,’/ 
ibid,, Vol. 20 (1925), pp. 1-12. 
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be raised and that the law of population growth over a period of 
centuries may be conceivably a series of ogive-like cycles. 

Criticism of Rationalized Trends. However, this rationalistic 
view of curve fitting to population and the attainment in this 
manner of a mathematical law of population growth have not 
gone unchallenged. Prof. A. L. Bowley, an outstanding English 
statistician, says, ‘‘I regret that so much prominence has been 
given to the logistic equation. It certainly has the merit, and 
the danger, of mathematical neatness, and it expresses what may 
be regarded as a fundamental law of population — that is, that 
population cannot increase indefinitely in constant geometric 
progression. There is, however, no reason a priori to suppose 
that the damping down of the increase is of so regular or uniform 
a nature that a mathematical function of the same form repre- 
sents it in all times and in all places, and none a priori to justify 
the use of a linear term (out of all possible functions) for this 
purpose. We should rather anticipate that the form of the 
function would be neither general nor linear. The justification 
for the logistic form is purely empirical, and, in fact, we are asked 
to accept it because it does give results which agree with the 
records of certain populations. Any other curve which gives as 
good an agreement has similar claims for representing the series 
of records. The closeness of the agreement is, I think, unduly 
accented by the very small vertical scales used by Dr. Pearl and 
Mr. Yule. . . 

T. H. C. Stevenson, another English statistician, rather 
prosaically declares that he finds sufficient explanation, without 
resort to logistic curves, for the rapid decline in birth rates since 
the end of the nineteenth century, in the dissemination of knowl- 
edge of contraception. 2 

More recently, the whole question of the rationality of curve 
fitting was taken up in an admirably thorough manner by 
George H. Knibbs, whose findings are apparently that the 
mechanical process of the curve fitting is empirical and must be 
accepted as empirical but that the law of population growth 
may be conveniently expressed by such equations when it is 

^ From remarks on Yule*s paper, op. ciL, p. 76. 

* “The Laws Governing Population , Journal of the Royal Statistical 
Society, Vol. 88 (1925), pp. 63-76, 
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thoroughly understood how those equations apply, and also 
their limitations.^ 

It is natural to scientists to be skeptical, particularly of other 
scientists' startling discoveries, and the student of social science 
must get used to such controversies and pick and choose for 
himself what he believes to represent progressive development 
of human knowledge and what merely overzealous creative 
imagination. It is in these attempts to explain phenomena that 
the progressive development of human knowledge occurs. 

Application of Rational Trends to Social Philosophy. It was 
pointed out above that Jevons had advanced the hypothesis 
that the law of organic growth applies also to social and economic 
phenomena. Following the example of the population curve- 
fitting group, scientific curiosity turned to the discovery of a 
rational conception of curve fitting to social, biological, and 
economic phenomena in order to replace purely empirical 
methods. As Wesley C. Mitchell has pointed out,^ “A step 
toward such a conception is represented by the frequent inter- 
pretation of certain trend lines as showing the ‘growth factor.' 
Statisticians dwell with satisfaction upon their demonstrations 
that certain industries have expanded decade after decade at a 
substantially uniform rate, or at a rate which has changed in 
some uniform way. They take almost as much pleasure in con- 
templating the somewhat similar rates at which different indus- 
tries have grown in given periods and countries. Nor are they 
at a loss for explanation of these uniformities. In view of the 
increase in population characteristic of the great commercial 
nations and of the advance in industrial technique, it seems 
scarcely fanciful to think of modern society as ‘tending' to 
produce an ever larger supply of goods for the satisfaction of its 
wants. On this basis, cyclical fluctuations appear as alternating 
accelerations and retardations in the pace of a more fundamental 

^ ‘‘Laws of Growth of Population,” Journal of the American Statistical 
AssodatioUf Vol. 21 (1926), pp. 381-398; and Vol. 22 (1927), pp. 49-59. 

* Business Cycles — The Problem Stated and Its Setting^ (1928), pp. 221-224 ; cf. 
Prescott, Raymond B., “Law of Growth in Forecasting Demand,” (1928), 
Journal of the American Statistical Assodationy Vol. 17 (1922), pp. 471-479. 
Later, Leroy E. Peabody fitted such a curve to railway traffic in the United 
States, “Growth Curves and Railway Traffic,” Journal of the American 
Statistical Assodationy Vol, 19 (1924), pp. 476-483. 
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process. Secular trends, in short, are taken to measure economic 
progress generation by generation. 

‘‘A bold speculation of this sort has been ventured by Raymond 
B. Prescott. He suggests that perhaps 'all industries, whose 
growth depends directly or indirectly upon the ability of the 
people to consume their products,^ pass through similar phases 
in the course of their development. Four stages seem to be 
common. 

1. Period of experimentation. 

2. Period of growth into the social fabric. 

3. Through the point where the growth increases, but at a 
diminishing rate. 

4. Period of stability. 

“On this basis, Prescott suggests that the secular trends of all 
such industries may be represented by a single type of curve — 
that yielded by the Gompertz equation. Every country may 
have a different rate of growth, and so may every industry, 
because no two industries have the same combination of in- 
fluences. They will trace the same type of curve, however, 
even though the rate of growth is different.^' 

More recently, an ambitious and carefully studied attempt to 
rationalize the whole subject of trends in economic phenomena 
was made by Simon S. Kuznets, of the National Bureau of 
Economic Research.^ Kuznets analyzes the various factors 
making for growth, and also making for slowing up of growth, 
under the following items: 

1. On the side of growth: 

Population increase. 

Changes in demand. , 

Technological changes. 

2. On the slowing up of growth: 

Slackening of technological progress. 

Retarding influence of other slower industries. 

Funds available for expansion decrease in relative size as industry 
grows. 

Competition of later developing industries in other countries. 

^Secular Movements in Production and Prices (1930); in 1943, Kuznets's 
work is still the best statistical study of this type. For more recent trend 
studies of a different type, see Edwin Frickey, Economic Fluctuations in the 
United StateSj 1860-1914 (1942), and Norman J. Silberling, i^he Dynamics 
of Business (1942). These studies use subjective methods for analyzing 
trends and cycles. " * 
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Kuznets fits logistic curves to a large number of production 
series and also fits appropriate curves to the corresponding price 
series. It should be noted that this type of rationalization does 
not apply to price series, and as a rule the curves that Kuznets 
fits to his price series were merely parabolas and represented 
empirical trends. One of the most interesting results of his 
work is his discovery and analysis of “secondary trends.” 



Fig. 139. — Production of Portland cement in the United States with logistic 
trend line, 1880-1924. 

Thus, from a large variety of data, he took out the long-time 
growth, upon the assumption of the existence of a logistic growth 
element, and he found, not only cycles, but also longer wavelike 
movements of 11 to 20 years. This is illustrated by Figs. 139 to 
141, reproduced from his book and showing the type of analysis 
as applied to cement production and prices, 1880-1924.^ As 
seen in Fig, 139, the heavy line represents the logistic curve, and 
there are long sweeps of the actual data in waves above and below 

^ * Op. <n7., pp. lQp-101, reproduced by permission of the author. 
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this growth curve, as well as cyclical movements of shorter 
duration. Figure 140 shows a parabola fitted to the course of 
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Fio. 140. — Factory prices of Portland cement in the United States, original data 
and primary trend, 1880-1924. 


prices of cement during the same period. Figure 141 shows the 
long wavelike movements in production and in prices, with the 
relative fluctuations of the actual data above and below these 



Fig, 141. — Production of and prices of Portland cement in the United States, 
1880-1924. Secondary trends and minor cycles. 

secondary trends. Kuznets calls the logistic groAvth curve the 
“primary trend line,” and the heavy, black, wavelike line in 
Fig. 141 represents the secondary trends of the production of 
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cement. The actual data fluctuate above and below these 
secondary trends in major and minor cycles. 

Before the publication of Kuznets^s work these longer move- 
ments had been studied by C. A. R. Wardell.^ Wardell called 
the movements major cycles’’ rather than secondary trends, 
and his method of analysis was quite different from that followed 
by Kuznets. He also attempted to give an explanation of the 
major business cycle that Kuznets rejects. ^ In 1927, also, there 
appeared in Russian a discussion of the whole problem of major 
cycles, which contains a report by Kondratieff and a counter- 
reply by D. T. Oparin. To explain these major cycles Kon- 
dratieff developed the theory that they are essentially cycles 
of expansion and contraction in the growth of the basic capital 
equipment of a country. 

Thus, starting with the desire to define the law of population 
growth more precisely and to bring the population problem into 
the realm of mathematical treatment, scholars have carried on 
by analogy into other fields; so far as economics is concerned, 
the principal result so far is the discovery of these long wavelike 
movements. Not only do the theoretical economists need to 
explain the old-fashioned business cycle (which was always a 
rather vague concept), but they now are challenged to explain 
(1) secondary secular movements or major business cycles, (2) 
ordinary business cycles, and (3^ minor business cycles. The 
analysis of time series, then, must include some additional types 
of fluctuations from those described in a preliminary manner at 
the beginning of the chapter. 

The following classification of movements is now suggested.^ 

1. Trend, or long-time growth, which appears to be logistic in 
character and for which a mathematical formula may be rational. 

2. Cyclical movements of three types, for which a rational 
mathematical formula is not appropriate. 

a. Secondary secular movements or major cycles. 

b. Cycles (the old theoretical business cycle). 

c. Minor cycles. 

^ An Investigation of Economic Data for Major Cycles, Thesis (University 
of Pennsylvania, 1927). 

* Op. cit., pp.^265-266. 

® C/. classification suggested by Prof. Willford I. King, which is similar, 
in ‘‘Principles Underlying the Isolation of Cycles and Trends,"' Journal of* 
the.American StatisScal Association, Vol. 19 (1924), p. 468. 



658 


STUDY OF DYNAMIC VARIABILITY 


3. Seasonal variations, for which a mathematical formula is not 
rational. 

4. Irregular fivLctuations, such as those due to wars, epidemics, 
floods, or strikes. These are called ‘‘residual fluctuations^' and 
may follow the normal curve. ^ 

Empirical Trends. Trend analysis, that is to say, the applica- 
tion of mathematical processes in order to obtain equations 
describing direction of movement of a time series, may be 
applied, not only for rational ends indicated in the discussion of 
the law of organic growth, but also empirically where no a priori 
knowledge about the character of growth or law of movement or 
trend exists. Indeed, the search for such a law may have no 
bearing on the analysis; the trend may be sought for the purpose 
of isolating and studying cyclical movements. When trends are 
found without seeking to verify some hypothesis concerning a 
law of growth but merely with respect to given data, they are 
empirical trends. 

Application of Empirical Trends to Cycle Analysis, The 
third factor mentioned at the beginning of this chapter as a 
force stimulating statistical analysis of time series has been the 
abstract study of the business cycle. Such abstract analysis 
has challenged the mathematical economist and the statistician 
to discover and to apply methods of statistical analysis that 
would measure the cycle. 

Economic history of the modern era has been one of alter- 
nating periods of relative prosperity and relative depression 
and has also been characterized by periods of more or less violent 
speculative activity. The Mississippi Scheme and the South 
Sea Bubble burst in France and •England in 1720, and there 
occurred commercial crises of major importance in 1763, 1772, 
1783, and 1793. During the eighteenth century these recurring 
periods of crisis excited much discussion, but eighteenth-century 
writings dealt mainly with the dramatic surface events and did 
not develop a theory explaining them. And indeed the funda- 
mental principles of economics were not formulated until the 
latter half of the eighteenth century. The publication of Adam 
Smith's Wealth of Nations in 1776 is usually taken as the debut 
of economics as a science. «. 


1 See pp. 285-297, and 570, 648. 
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While a group of economists following Adam Smith developed 
a theoretical explanation of the operation of economic forces 
under normal conditions, or in the long run, another group that 
assumed the role of critics of the ^‘economists'^ developed 
theories of the business cycle. These were such men as Sismondi, 
Rodbertus, and Karl Marx. J. C. L. Simonde de Sismondi, an 
Italian Swiss, had originally been a thorough convert of Adam 
Smith and laissez faire and had become the Continental expositor 
of his theories; but as he said, writing in 1818 and referring to 
the depression of 1815-1817, he was deeply affected by the com- 
mercial crisis that Europe had experienced and by the cruel 
sufferings of the industrial workers that he had witnessed in 
Italy, Switzerland, and France and that all reports showed to 
have been at least as severe in England, in Germany, and in 
Belgium.^ He set about developing a theory to explain the 
recurrence of -such periods, and in his work are found many of 
the ideas current even today concerning the origin and explana- 
tion of the business cycle. He suggested that the business cycle 
is due to the faulty organization of the capitalist system and that 
the system is planless and therefore needs planning. He also 
suggested the explanation that what is needed is a better dis- 
tribution of income. lie suggested the oversaving hypothesis. 
His principal explanation is the inequality in the distribution of 
incomes resulting in glutting of the markets and the production 
of crises and depressions. 

The idea that commercial crises are cyclical in character 
evolved early in the nineteenth century; some even went so far 
as to advance the theory that they occur every 7 or every 11 years. 
In 1875, this led the economist and statistician, W. S. Jevons, to 
propound a theory that the business cycle is due to cycles that 
occur in sun spots, which it had been discovered have a rhythm 
of about 11 years.^ During the latter half the nineteenth cen- 
tury a number of statistical attempts to discover the business 
cycle were made. The attempts used the idea of smoothing 
out the irregular fluctuations in the curves of raw data and 

^ Mitchell, op. a7., pp. 4-5. The historical material here given on the 
business cycle is taken principally from this source. 

* For a mort/ complete discussion of the history of business-cycle theory 
than it is possible to give here, see ihid. and also Ernst Wagemann, Economid 
Rhj^thm, (1930), ei+Jicr of which contains further bibliographical references. 
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thereby clarifying the cyclical movements. The earliest 
examples of such statistical work appear to be in 1884,^ 

Both Jevons and the later experimenters of the nineteenth 
century were content with attempts to discover cyclical move- 
ments in separate individual series. In 1909, Beveridge in 
England; in 1911, Julin in France; and in 1913, Mortara in 
Italy conceived the idea of combining a number of series into, a 
composite statistical measure of the business cycle. The work 
of carrying out this task was then largely taken over by the 
American statisticians, in the construction of the so-called 
‘^barometers'^ of business conditions that have been described 
in Chap. XIX Index Numbers. The period up to about 1914 
may be characterized as one during which interest in the subject 
of the business cycle was intense. Economists were in sharp 
controversy with the business-cycle theorists — denying emphat- 
ically the implications that they drew from their analysis of the 
statistics available and from their theoretical explanations of the 
business cycle. At the same time, the disturbing theories of 
the business-cycle students had greater claim to general interest 
because they touched upon a more vital and present thing than 
was customarily dealt with by the conventional economist. 
The conventional economist was explaining how things tend to 
happen under normal conditions, and the business-cycle theorists 
loudly proclaimed that we never live under “normal conditions” 
and that the theories of the economist were therefore useless. 
At the same time, the interest of the practical businessman was 
aroused by the desire for knowledge of the relationship between 
his own particular business and the general business cycle. 

Development of Technique for,, Time-series Analysis. The 
pressure to develop a statistical technique to analyze the prob- 
lem was thus very great, and the accumulation of available 
statistical material to analyze had been rapid for a number of 
years. The technique that developed assumed two general 
characteristics, one of which has since been extensively used, 
the other less frequently. 

The first method of technique that developed was the discovery 
statistically of the cycle in time series by the removal of the 

1 PoTNTiNG, J. H., and R. H. Hooker, “A Comparison V)f the Fluctua- 
.tions in the Price of Wheat and in the Cotton and Silk Imports into Great 
Britain,” Jowrnal of the Royal Statistical Society ^ Vol. 47,(1884), pp. 34-64. 
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trend from a series of annual data. Trends were fitted empir- 
ically to the data by the method of least squares or some other 
method — most commonly by the method of least squares — using 
relatively short periods of time, say 9 to 19 years. The cyclical 
movements then were the measures of the movements of the 
data above and below the empirical trend. Prof. Willford I. 
King said, '‘Any particular type of fluctuation in which we 
happen to be interested can be successfully studied only when 
most of the other kinds of fluctuations have been eliminated.^' ^ 

This is, of course, the raison d^etre for the empirical trend 
analysis, which is primarily for the purpose of isolating the ordi- 
nary and the minor cycles. The major cycles or secondary 
secular movements are best studied by the Kuznets methods that 
have been described and illustrated. The methods of analysis 
used are essentially similar to those employed in empirical trend 
analysis, but the Kuznets logistic trend lines may be rationalized 
in terms of a law of organic growth. 

The second method of technique that developed was the 
attempt to apply harmonic analysis or the periodogram to series 
of economic data, a different application of the method of least 
squares. This was the work of Henry L. Moore of Columbia 
University in his application of Fourier's theorem, the mathe- 
matics of which Fourier had developed a century ago in his 
Theorie des mouvements de la chaleur dans les corps solides and 
for which he was feted by the Academic des Sciences in 1812. 

Prof. Moore applied the mathematics of the periodogram to the 
records of rainfall in the corn belt of the United States, working 
out the periodogram eejuations for the cycles of rainfall; he 
discovered similar cycles in crops and introduced the harmonic 
analysis into modern statistical method. He says:^ "The prin- 
cipal contribution of this essay is the discovery of the law and 
cause of economic cycles. The rhythm in the activity of eco- 
nomic life, the alternation of buoyant, purposeful expansion 
with aimless depression, is caused by the rhythm in the yield 
per acre of the crops; while the rhythm in the production of the 

^ Journal of the American Statistical Association^ Vol. 19 (1924), p. 468. 

^Economic Cycles: Their Law and Cause (1914). Cf. Crum, W. L., 
"Periodogram Chap. XI in H. L. Reitz, Handbook of Mathe- ^ 

matical Statistics (1924:), Also Brunt, David, The Combination of Observa-, 
Hons (1931), Chaps. XI and XII. 
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crops is, in turn, caused by the rhythm of changing weather, 
which is represented by the cyclical changes in the amount of 
rainfall. The law of the cycles of rainfall is the law of the cycles 
of the crops and the law of economic cycles. 

The mathematics of the harmonic analysis are somewhat com- 
plex, and this method has not attained the popularity that 
has been attached to the removal of empirical trend by using 
straight lines or second- or third-degree polynomials, where the 
mathematical analysis involved is quite simple. 

Use of Functions of Arc Tangent and Orthogonal Pol3rnomials 
in Trend Analysis. In recent years two modified forms of the 
conventional trend analysis by the method of least squares have 
been developed. In 1928, it was suggested that the inverse 
trigonometric function, or arc tangent, could be adapted to 
measuring trends in series behaving in the following manner:^ 

1. A downward tendency approximating a straight line but 
of such nature that projection of a straight line into the future 
would lead to absurd results^ that is, negative or ridiculously 
small positive values when comparatively large positive values 
only are possible. 

2. Approximately a linear growth or decline, followed by an 
abrupt change in level (rise or drop) and subsequent resumption 
of the early tendency. 

3. Approximately a straight-line trend interrupted by a sharp 
rise or drop, followed by another abrupt change in level and 
subsequent resumption of the early movement at the same or a 
different level. 

The method was used successfully in fitting a trend to the 
annual prices of International Paper common stock for the period 
1900-1926 and to the annual index of wholesale prices in the 
United States, 1900-1928. 

The orthogonal analysis is a method invented for reducing the 
amount of arithmetical calculation involved in fitting poly- 
nomials to time series by the method of least squares, especially 
second- and third-degree polynomials or polynomials of higher 
degree. The fitting of a polynomial of higher than second degree 
to a time series involves laborious calculations, particularly if a 
considerable period of time is covered. This laborsaving method 

o ^CabmicHael, F. L., '^The Arc Tangent in Trend Determination,'' 
Journal of the American Statietical Aeeociationj Vol. 23 (1928), pp. 253-262. 
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is described-in detail, together with tables of values to facilitate 
its use, in Chap. XXII. ^ 

^See pp. 600-615. Also cf. Jordan, Charles, “Approximation and 
Graduation According to the Principle of Least Squares by Orthogonal 
Polynomials,^^ The Annals of Mathematical Statistics, Vol. 3 (1932), pp. 257- 
357. Cf. Romanovsky, V., “Note on Orthogonalizing Series of Functions 
and Interpolation,” Biometrika, Vol. 19 (1927), pp. 93-99; Jordan, Charles, 
“Sur une s^rie de polynomes dont chaque somme partielle repr^ente la 
meillcure approximation d^un degrc domic suivant la m6thode les moindres 
carr6s,” Proceedings of the London Mathematical Society, Vol. 20 (1921), pp. 
297-325; and Dieulefait, Carlos E., “La dctorminacidn de la tendencia 
secular en las series econdmicas,” Gabinete de Estadistica, Rosario, Argentine 
Republic (Santa Fe), Universidad Nacional del Litoral (1932), pp. 1-52. Cf. 
Fischer, R. A., Statistical Methods for Research Workers (4th ed., 1932), 
pp. 133-142. 
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TREND ANALYSIS 

Empirical Trend vs. Rational Trend. Both empirical and 
rational trends are obtained by analysis from raw data; the 
difference between the two is that a rational trend can be 
explained in terms of long-time growth or decline, whereas an 
empirical trend has no meaning per se. The empirical trend is 
a useful tool of analysis, as will be seen in the ensuing discussion. 

In the preceding chapter the attempt was made to convey the 
idea that a rational trend is one that is found for its own sake; 
it has a rational explanation and is useful as a method of inter- 
pretation in itself. While it may be true that the rationaliza- 
tion that is made with respect to such trends is preliminary or 
even tentative, nevertheless the original intent is to make a 
rational use of them. Empirical trends are those for which there 
is admittedly no rational basis at the start, being used merely as 
a convenient method of removing from the data longer time 
movements that obscure the shorter time cyclical fluctuations. 

Empirical trends in themselves may have no rational sig- 
nificance as a description of any type of long-time growth, or 

at a p oint in t ime coincident with the peak of^ a^^condgiy secular 
movemgpi. w.auld.J 3 resumably be in the form of a parabola . At 
another point in time, a 9-year trend analysis may give a straight 
line, or a logarithmic line. If a trend line happened to be cal- 
culated for a period of time from the low point of a secondary 
secular movement to the high point of another, the empirical 
trend might assume the form of a Verhulst growth curve; but 
it may have no such significance as a law of growth in that case, 
being simply the result of happening to take an empirical trend 
for that period of time. An examination of the heavy black 
curve representing the secondary trends in cement production 
(Fig. 141, page 566) will help to make clear what is nieant by these 
statements. 
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Detecting Cycle by Removing Empirical Trend, While empirical 
trends may have no rational significance per se, the fitting of 
an empirical trend to the annual data of a time series will make 
it possible to isolate the residuals from the trend. These 
residuals constitute the cycles and minor cycles of the period 
analyzed. The first clear statement of the analysis of time 
series by this method was made by W. M. Persons in 1915.^ 
The method is illustrated by examples at the end of this chapter. 

Thus the function of empirical trend analysis is to obtain an 
approximation to some longer term movement for the purpose of 
eliminating, this in order to study shorter term movements of a 
cyclical or accidental character. The empirical trend may 
approximate a segment of a long-term cyclical movement, or it 
may approximate a portion of long-term growth in the series 
that might itself have rational explanation. What the empirical 
trend measures depends upon the circumstances in each problem, 
and the discovery of the rational nature of an empirical trend 
depends upon a priori knowledge. 

Methods of Fitting Trend, Three methods of fitting trend to 
time series can be distinguished: (1) the method of least squares, 
(2) the method of selected points, and (3) the method of 
averages. 

Fitting a Trend Line by the Method of Least Squares, Figure 142 
represents a plane in which there are seven points. Pi, , Pi, 
To simplify the arithmetic an uneven number of points is taken, 
and the middle point is selected for the location of the y-axis.^ 
Accordingly, t varies from —3 to +3. The coordinates of the 
points, as may be observed from the figure, are as follows: 

Piit = -3, yi) P 2 (t -2, yz) Pz(t = -1, ys) 

= 0, 2 / 4 ) Ph{t = 1, yh) P^it = 2, yz) Pi{t = 3, yi) 

1 American Economic Review, December, 1916, pp. 739-769; Pvhlications 
of the American Statistical Association, June, 1917, pp. 602-642; Harvard 
Review of Economic Statistics, Preliminary Vol. 1 (1919). Cf. Mitchell, 
W. G., Business Cycles — The Problem Stated and Its Setting, pp. 200, 212-213, 
328—330. 

* For statistical purposes it is more convenient to take a more recent year 
as the time origin than that of the birth of Christ. Thus, if a given set of 
data run from 1927, say, to 1937, it might be convenient to choose 1932 as 
the zero year. If this were done, then 1933 would be i = 1, 1935 would 
be f = 3, 1929 would be < « — 3, etc. 
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The corresponding points on the straight line to be found, for 
example, point A in the figure, may be represented by tlu^ 
folloAving coordinates: 

{t = -3, y\) (t = -2, 2 /') {t= -1, y'a) 

(t — 0, 2 / 4 ) ~ 1> Vb) “ 2, 2 / 0 ) ~ 3, 2 / 7 ) 

The general form of the equation for a straight line in a field of 
coordinates y and t is y = a + bt, and for this line the equation 
is as follows: 

y' = a + bt (1) 

The line is determined for the particular case by finding values 
of a and b. 



The line that is sought is the one from which the sum of the 
squared deviations of the points from the line is less than such a 
sum with respect to any other line. This is the least-squares 
criterion. 

The vertical residuals of particular points from the line are 
as follows, as illustrated in Fig. 142 for Pq: 

ri = yi — y'l 
rt = yt- y'i 

rt = y« — y't (illustrated by P* in Fig. 142) 

= yi — y'l 

Some of these variations (designated as r) are negative, for 
example, at P-i, while others are positive, as at Pe- When 
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squared, however, they are all positive and the conditions that 
must be satisfied according to the least-squares criterion for a 
line that will best fit these points is that Sr* = minimum, in 
other words, that 

2(2/ — 2/0 ^ = minimum (2) 

The value of y', from Eq. (1), may be substituted; Eq, (2) then 
becomes 

2(2/ — a — = minimum (3) 


The condition under which Eq. (3) is true is that the total 
differential is equal to zero, in other words, that 


d(Sr*) 




Inasmuch as da and db cannot be equal to zero, this gives the two 
conditions that^ 

^ - a - bty = S2(y - a-bt) =0 | 

^ 2(2/ - a - hty = 22/(2/ - a - 60 = 0 

and hence the following two equations, by canceling out the 2^s 
and carrying out the summations: 

22/ = iVa + 62/ (i) 

2/2/ = a2/ + 62/* (ii) 

In these two equations, all the terms are known, except a and 
6; because 2/ = 0 and 22/ is the sum of the known 2/’s of the 
seven points Pi, . . . , P7. The 2/* is 

9+4+1+0+1+4+9 

Because 2/ = 0, values for a and 6 can be found as follows: 

a = ^ from Eq. (i) 

b = from Eq. (ii) 

^ In the case under consideration, it is not necessary to be concerned with 
the possibility that these same conditions might also hold true for a maxi- 
mum or a minimum, since the conditions of the problem indicate that it 
is a minimum. 
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Accordingly, the equation for the line of best fit, by the 
criterion of least squares, is as follows: 




(5) 


Numerical Illustration, As a more concrete illustration, 
values will be assigned to the y^s of the seven points, as follows 
it coordinates remaining as before) : 

Piiy = 5) P^iy = 2) P^{y = 7) P,{y = 4) 

P^{y = 6) P,{y = 10) P^{y = 8) 

An orderly work sheet will be set up in order to find Si/, S^i/, 
and Si^. iV, of course, is equal to 7. 


Work Sheet for Finding Best-fitting Straight Line for Seven 

Given Points 


% 

V 

ty 

1 

-3 

5 

-15 

9 

-2 

2 

-4 

4 

-1 

7 

-7 

1 

0 

4 

0 

0 

1 

6 

6 

1 

2 

10 

20 

4 

3 

8 

24 

9 



50 


ii 

o 

ii 

-26 

= 28 



ii 



The equation for the best-fitting line according to the least 
squares criterion is therefore as follows [see Eq. (5)]: 

y' = ¥ + 

or 

y' = 6 + 0.86< 

It will be well to note what the equation says. First, with 
each unit increase of i, the line (that is, the value of y*) rises by 
0.86. This value, 0.86, is called the “slope’’ of the line; and it 
is the tangent of the angle that the line makes with the ^-axis 
or with any line parallel to the ^-axis. Second' it says that, 
when ^ 0, I/' » 6. This means that the line passes through 
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the 2 /-axis at a point +6 from the <-axis (when the y-axis is 
located at the middle point in time). 

If the y-axis were shifted from its present location to the 
position < = — 3, everything else remaining in its original 
position, the value of the t coordinates of all the points P will 
change to accord with the new location of the y-a.ids. Also, it is 
to be noted that the above equation would then become 

2/' = [6 - 3(0.86)1 + OMt 
or 

y' = 3.42 + OMt 

since 3.42 will be the intercept on the new 2 /-axis. 



Fitting Second- or Third-degree Curves. Second-, third-, or 
even high-degree curves may similarly be fitted by the method 
of least squares. It may happen that the points are distributed 
in such a manner that a straight line does not fit. For example. 
Fig. 143 shows seven points that would be better fitted by a 
parabola. The general form of the equation for such a curve is 

2/' = a + 6^ + cP 

The equations for finding values of a, 6, and c, for such a best- 
fitting parabola, are worked out on precisely the same principles 
as those for finding a and b for the best-fitting straight line.^ 
That is to say, the equation y' ^ a + bt + cP is fitted to the 
points so that 

2(2/ — 2/0^ = minimum (6) 

and when the value of y' is substituted in this equation, it 
becomes • « 

1 For a better method of fitting polynomials by the method of least* 
squares, see Chap. XXII, Orthogonal Polynomial Trends. 
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— a — — ci^y = minimum (7) 

When this expression is differentiated with respect to a, b, 
and c, following the same method as in Eqs. (4), (i), and (ii), 
the equations for finding a, 6, and c are obtained, as follows: 

Xy Na + hXt + (i) 

7^ty = a2« + (ii) 

Xt^y = + c2t^ (iii) 

A work sheet such as the following form (leaving out columns for 
the uneven powers of t; they will presumably all be zero since 
the zero value of t is selected in the middle of an odd number of 
years) is used for finding values of a, 6, and c. 


Work Sheet for Finding Best-fitting Parabola for Seven Given 

Points 


t 

V 

iy 







\ 















- • • • 

« • • • 

Xty ‘ ^ • 

= • • ■ 

2^2 « . . . 

2^4 = . . . 


Since = 0, when the sums of the columns in the work 
sheet are substituted in Eqs. (i), (ii), and (iii) above, the three 
unknowns a, 6, and c may be found by solutions of these. 

Probability Theory I s Not A pplied. It must be remembered that 
the application of the least-squares criterion for obtaining the 
line that best fits a time series does not involve the application 
of the theory of least squares in the sense that the trend line 
obtained is a most probable line, ..expressive of a law of move- 
ment or growth in the probability sense. ^ As originally applied, 
the theory of least squares had a definite connection with the 
theory of probabilities because it was devised as a method of 
obtaining a measure of the most probable orbit of a comet, etc. 
In the fitting of a trend line to a single time series there is no 

multiplicity of cases fluctuating in a normal distribution about the 

• 

1 Cf. Kuznbts, Simon S., Secular MovemerUs in Production and Prices 
(1930), p. 62, who cites W. H. R. I^exis, Zur Theorie der Massenerscheinungen in 
der menschUchenGesellachaft (Freiburg, i.R., F. Wagner, Ed; 1877), pp. 31-33. 

. See also Txntner, Gerhard, *^The Analysis of Economic Time Series,’’ 
Journal of the American Statistical Associaiionj Vol. 35 ^(1940), pp. 93-100. 
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trend line. The use of the least-squares criterion in trend fitting 
for time series is merely the application by analogy of a method 
that produces desired results; it gives an objective criterion for 
finding the line of best fit. If the analyst can be satisfied with 
a less objective method, he may use, for example, the method of 
selected points, which will now be described. 

Methods of Selected Points. One of the simplest methods 
of determining the trend of a time series is to make the trend 
‘‘line” pass through certain points selected as representative of 
normal values. This line^ may be drawn in a purely freehand 
fashion, or a mathematical equation may be determined such 
that it is satisfied by the coordinates of the selected points. 

To determine a unique mathematical equation in a given 
case the number of selected points must be taken equal to the 
number of parameters in the equation. Thus, if a straight-line 
trend seems appropriate, two normal years are selected (pref- 
erably near the ends of the series) and the values of a and h in 
the equation y' ^ a + ht are so determined that the equation 
is satisfied by the values of t and y for the selected points. If 
a parabolic trend of the type y' = a + is deemed 

appropriate, then three normal points must be selected to 
determine the values of a, 6 , and c. In general, if a polynomial 
of the nth degree is taken to portray the course of the trend, 
viz.y y' = a + bt + ct^ + • • • + then there must be n 
selected points. The polynomial is the simplest type of mathe- 
matical equation to employ for this purpose. Other, more 
“rational” types may also be fitted by this method, however, 
and its use in fitting a simple logistic curve is described below. 

The actual process of finding the mathematical equation of 
the chosen type that is satisfied by the selected points consists 
in solving n simultaneous equations, n being the number of 
selected points (or the number of parameters to be determined). 
Thus if (tiyyi) and (<2,2/2) are the coordinates of the selected 
points, the straight line y' ^ a + bt passing through these 
points is given by the solution of the following equations for 
a and 6 : 

yi = a + bti 
• ^2 = ^ H" ^<2 

^ “Line” is here used in the generic sense; it may be either straight or* 
curved. ^ 
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For example, if the time scale is such that h ^ 3 and ^2 = 9 
and if the y values for these years (or months) are^ yi ~ 68 and 
yi = 110, then a and h are found by solving the equations 

68 = a + 36 
110 = a + 96 

These yield a = 47 and 6 = 7; hence the equation for the given 
trend is y' = 47 + It 

If the equation to be fitted is a second-degree parabola 
I/' = a + 6< + ct^ and if (<i,yi), .(^ 2 , 2 / 2 ), and (^ 3 ,^ 3 ) are the 
coordinates of the selected points, then a, 6, and c are determined 
by solving the equations 

2/1 = a -f 6^1 + ct\ 
y2 ^ o, 6/2 "t“ ct\ 

1/3 = a + 6^3 + ctl 

Three equations are more difficult to solve than two; but if the 
time scale is chosen so that t\ = 0, then these reduce to 

yi = a 

2/2 = u + 6<2 + ctl 
2/3 = u + 6<8 + ctl 
or 

2/2 — 2/1 = + ctl 

2/3 “ 2/1 = &^3 + ctl 

and two equations are obtained for determining 6 and c, the 
value of a being yi. For example, if the selected points are 

(ti = 0, 2/1 = 68) (t 2 = 6, 2^2 = 110 ) (/b = 12, ys = 200) 

then a = 68 and 6 and c may be fbund from the solution of the 
equations 

110 - 68 = 66 + 36c 
200 - 68 = 126 + 144c 

The results are 6 = 3 and c = | = 0.67; hence, the parabola 
which passes through the given points is 

2/' = 68 + 3/ + 0.67/®, origin at /i = 0 

When higher degree polynomials are fitted in this way, the 
isimultaneous equations may be solved by repeated substitution, 
1 These values may be actual values or values estimat/)d as normal. 
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or special methods making use of finite differences may be 
employed.^ 

Method of Averages. Even less refined methods of fitting 
lines to data than those already described could be applied; in 
fact, the analyst could, if he so desired, merely draw the line 
that seems to fit the plotted data. The objection to this method 
is that it is too subjective — no two people would draw the same 
line. A certain degree of objectivity is secured by applying the 
method of selected points, which has already been described, or 
by using a modification of that method, namely, the method of 
averages. The method of averages merely suggests a refine- 
ment in the selection of the points. It can be illustrated by the 
fitting of a straight line, but it could be applied to curves as well. 

Work Sheet for Fitting a Straight-line Trend by the Method op 

Averages 

t y 

1 5\ 

2 2/ =* 3, 2/8 “ 5 

3 7 > For ^ = 3, 2/ is taken as the average of the first five 

4 4\ 2 /*s; that is, =* 6. 

6 7/ 

6 8^ 

7 15 /For ^ 8, 2/ is taken as the average of the last five 

8 19 ‘ 2 /*s; that is, — 15. 

9 18 ~ 8, 2/8 ~ 

10 15/ 

The trend line is the straight line passing through the two 
points ^ = 3, j/' = 5 and f = 8, 2/' = 15. Following the same 
procedure as that used in the method of selected points, the 
parameters a and b are found by solving the following two 
equations: • 

5 = a + 36 
15 = a + 86 

from which it is found that 6 = 2 and a = —1, so that the 
trend line is y' = — 1 + 2t, 

Method of Moving Averages. Ordinarily the method of moving 
averages is used with monthly data, but it could be used with 
annual data if an appropriate number of years over which to 

' For the latter, the reader is referred to E. T. Whittaker and G. Robinson,*' 
The CalciUua of ObservtUiona (1924), Chap. I. • 
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average or smooth the data could be determined. The diflSculty 
of determining the proper number of years for the averaging 
period is one of the objections to this method; another objection 
is that it does not give an equation of trend. The method of 
moving averages is explained in Chap. XXIII, Seasonal Variation. 

Advantages of the Method of Least Squares. The advantage 
of using the least-squares line is that it gives a line from which the 
residuals add up to zero and when squared are a minimum; this 
supplies an objective criterion to the fit of the line. In addition, 
the least-squares method of trend fitting is a very flexible device 
that can be widely applied and varied according to the type 
of line desired. If a complex trend line is desired, a mathe- 
matical procedure based upon the least-square criterion is 
handily available. The method of orthogonal polynomials 
explained in the next chapter, for example, is an application of 
the method of least squares. 

ILLUSTRATIONS OF RATIONAL TRENDS 

As indicated in the preceding chapter, rational trends are 
likely to be logistic in character. The simplest type of logistic 
curve is of the form y = ah\ which may readily be reduced to a 
straight line if the equation is expressed in logarithms, as follows: 
log 2 / = log a + i log 6. 

Trend of a D3ring Institution. If the early development, 
growth, and arrival at maturity of a new economic institution 
follow the pattern suggested by Raymond B. Prescott, as 
explained in the preceding chapter, presumably the disappear- 
ance of a dying institution would follow a reversal of that pattern. 
Thus, it would die slowly at first, then rapidly, and then slowly 
again until it finally disappeared. If such is the case, the 
appropriate equation to use is one of the Verhulst, Pearl-Reed, 
or Gompertz types of curves. However, an economic institution 
that is disappearing from the economic system might depart in 
another manner; it might be struck a sudden devastating blow 
by a new development that caused it to die or decline according 
to the simple logistic curve y = ah*. Such appears to be the 
case with respect to a certain type of commercial bank credit 
known as ‘'open-market commercial paper.'' Many author- 
" ities on money and credit believe this to be a dying institution 
dn this country; and accordingly the downward trend illustrated 
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in Table 81 and Fig. 144 may be considered a rational trend. ^ 
The data used are annual average monthly volumes of open- 
market commercial paper outstanding; and Table 81 is a work 
sheet for calculating the straight-line logarithmic trend line for 
these data, following the method indicated on pages 666 to 568. 
Here, however, the straight line is fitted to the logarithms of the 
data instead of to the data themselves. 

The equation for this trend line is y' = aV, so that, by the 
rule of logarithms, 

log y' = log a + < log b 

The two least-squares equations that would be obtained by the 
method explained above are as follows:^ 

S log y = iV log a + log bZt 
Ht log y = log aSi + log bY,t^ 

Upon substituting the sums taken from the appropriate columns 
of Table 81, this gives 

36.18035 = 23 log a 
-38.41844 = 1,012 log 6 

from which 

log a = 1,57306 and log b = -0.037963 

Therefore, the equation of the best-fitting (according to the 
least-squares criterion) logarithmic trend in this case is 

log I/' = 1.57306 - 0.037963^ 

When a logarithmic straight line is fitted to a time series by 
the method of least squares, it is the sum of the squares of the 
ratio residuals that is made a minimum — and not the sum of the 
squares of the actual residuals as is the case where an arith- 

^ For explanations of the demise of open-market commercial paper see 
B. H. Beckhart, The New York Money Market^ Vol. 3, pp. 242-246; 0. A. 
Greef, The Commercial Payer House in the United States (1938), pp. 123-127; 
P. Hunt, Portfolio Policies of Banks in the United States 1920-1929 (1940), 
pp. 11-38. 

* See pp. 566-567. 
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Table 81. — Work Sheet for Calculating Annual Index of Normal 

AND Trend 

Straight-line logarithmic trend 

Data: Open-market commercial paper outstanding. Annual averages of 

monthly data 
(In millions of dollars) 

Equation of trend: log y' »■ 1.57306 — 0.037963i 



1,084 

2.03503 

1,113 

2.04650 

749 

1.87448 

768 

1.88536 

834 

1.92117 

873 

1.94101 

743 

1.87099 

629 

1.79865 

585 

1.76716 

494 

1.69373 

322 

1.50786 

489 

1.68931 

264 

1.42160 

106 

1.02531 

95 

0.97772 

156 

1.19312 

174 

1.24055 

188 

1.27416 

296 

1.47129 

239 

1.37840 

198 

1.29667 

234 

1.36922 

317 

1.50106 


« 23 



36.18035 


Elogy 




*-38.418441 

1,012 

log y 



Source: Compiled from the Annual Report of the Federal Reserve Board, 1929, p. 121; 
1935, p. 174; and from the Surtey of Current BuHneeo^ Annual Supplement, (Vol. 20, 1940), 
-ja. 47. * • 
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metical straight line is fitted.^ It is the following expression 
that is minimized: 

S(log y - log yy 

which is the same as 

If the logarithm is expanded in a power series, this sum is seen 
to be roughly equivalent to 





Fig. 144. — Open-market commercial paper outstanding in the United States, 
1919-1941. Logistic trend fitted by method of least squares. 

For a dying institution, open-market commercial paper out- 
standing showed remarkable vigor in the years 1933-1941, and 
perhaps the monetary economists were premature in their 
predictions. Whether or not they were is a matter for the 
future to reveal. 

^ Cf. pp. 666-5Q7. 
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Trend of a Growing Institution. Method of Selected Points 
Illustrated. If the hypothesis made by Raymond B. Prescott 
can be demonstrated or illustrated in real life, it should surely be 
done by the development of the automobile during the past 
three or four decades. Table 82 and Fig. 145 give an illustration 
of the fitting of a rational trend that purports to represent this 
type of growth, constituting thereby a test of this hypothesis.^ 
They also illustrate the method of fitting a logistic curve of the 
Pearl-Reed type by using selected points. , 

The equation of the curve may be written in the form 


y = 


I + m 


( 8 ) 


in which m = 

It is thus required to find three parameters k, a, and 6, which 
is more conveniently done by first converting the equation into 
logarithms, as indicated in the work sheet. 

By using annual data, consisting of monthly average output 
of passenger cars and trucks each year from 1903 to 1941, a 
graph was made and from its examination the following selected 
points were adopted: 


1909 

1922 

1935 

^0 0 

ti = 13 

ti = 26 

Vo = 10 

yi = 260 

Vi - 320 


The values of the parameters A, a, and b may be found by using 
the following equations 

2yoyiy2 - y^iyo + 3 / 2 ) 


k = 
a = loge 
b 


ym - y\ 

- yo 


(9) 


iw ~ 

n y\(Jt - yo) 


in which n is defined as — h. 


‘ Explained on pp. 563-555. 

* C/. Pearl, Raymond, Studies in Human Biology^ (1924), Chap. XXIV; 
mThe Biology of Population Growth (1925), •’p. 22. Citations used from 
JT. E. Croxton and D. J. Cowden, Applied General Statistics (1939), pp. 444- 
445, 852-863. 
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Thus, for the problem illustrated, 

2 X 10 X 260 X 320 - 250 X 250 X 330 
* 10 X 320 - 250 X 260 

_ 5(128 - 1,650) ^ -7,610 

M - 26 -^.72 

= 320.82630 

, 320.82630 - 10 

a = loge Yq 

= log« 31.082630 
= 2.302585 login 31.082630 
= 2.302585(1.4925178) 

= 3.4366491 

1 10(320.82630 - 250) _ 2.302585 , 70.82630 

" 13 250(320.82630 - 10) 13 7,770.6575 

= 0.1771219 (login 0.00911458) 

[login 0.00911458 = 7.9597368 - 10] 
= 0.1771219 (-2.0402632) 

= -0.3613753 

As indicated in Table 82, the values for m for various values of 
t are conveniently found by the use of logarithms; thus, since 

m = e«+“ 

log m = login e(a + ht) [since login e = 0.43429] 
= 0.43429(a + bt) 

or, for the example illustrated, 

log m = 0.43429(3.4366491 - 0.3613753<) 

= 1.4925023 - 0.1569417< 

For the year 1909, when t =, 0, the value of log m is 1.4925023, 
as may be seen from the work sheet (Table 82), and the values 
of log m for other values of t are obtained by the successive 
algebraic subtraction of the constant —0.1569417 through the 
years preceding 1909 and by successive algebraic addition of the 
constant —0.1569417 through the years subsequent to 1909. 
These are the logarithms of m for the various values of t, that 
is, for the various years. In the next column of the work sheet, 
the antilogarithms are entered, which, when added to 1, are 
divided into, the constant k in order to find the trend values for 
each year. These steps are shown in the next three columns of 
Table 82. An index of normal, that is, y/y', is also calculated. 
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Table 82. — Work Sheet for Calculating Index op Normal and Trend 
Logistic trend of the Pearl-Reed type 

Data: Automobile production in the United States. Annual averages of 

monthly data 
(In thousands of cars) 


Year 

1 

log m 

m 

1 -f m 

1/' - 
k* 

1 + 

V 

y 

V 

1903 

■ 

2.4341525 

271.7393 

272.7393 

1.176 

0.9 

76.5 

1904 

-5 

2.2772108 

189.3261 

190.3261 

1.686 

1.9 

112.7 

1905 

-4 

2.1202691 

131.9073 

132.9073 

2.414 

2.1 

87.0 

1906 

-3 

1.9633274 

91.90234 

92.90234 

3.453 

2.8 

81.1 

1907 

-2 

1.8063857 

1 64.03030 

65.0303 

4.933 

3.7 

75.0 

1908 

-1 

1.6494440 

44.61122 

45.61122 

7.034 

5.4 

75.8 

1909 

0 

1.4925023 

31.08152 

32.08152 

10.000 

10.9 

109.0 

1910 

1 

1.3355606 

21.65515 

22.65515 

14.161 

15.6 

110.2 

1911 

2 

1.1786189 

15.08757 

16.0876 

19.942 

17.5 

87.8 

1912 

3 

1.0216772 

10.51177 

11.5118 

27.869 

31.5 

113.0 

1913 

4 

0.8647355 

7.32380 

8.3238 

38.543 

40.4 

104.8 

1914 

5 

0.7077938 

5. 10263 

6.1026 

52.572 

47.4 

SO. 2 

1915 

6 

0.5508521 

3.55510 

4.5551 

70.432 

80.8 

114.7 

1916 

7 

0.3939104 

2.47691 

3.4769 

92.273 

134.8 

146.1 

1917 

8 

0.2369687 

1.72571 

2.7257 

117.704 

156.2 

132.7 

1918 

9 

0.0800270 

1.20234 

2.2023 

145.675 

97.6 

67.0 

1919 

10 

-0.0769147 

0.83769 

1.8377 

174.581 

161.1 

92.3 

1920 

11 

-0.2338564 

0.58364 

1.5836 

202.588 

185.6 

91.6 

1921 

12 

-0.3907981 

0.40663 

1.4066 

228.082 

134.7 

59.0 

1922 

13 

-0.5477398 

0.28331 

1.2833 

249.999 

212.0 

84.8 

1923 

14 

-0.7046815 

0. 19739 

1.1974 

267.938 

336.2 

125.5 

1924 

15 

-0.8616232 

0.13752 

1.1375 

282.040 

300.2 

106.4 

1925 

16 

-1.0185649 

0.095815 

1.0958 

292.773 

355.5 

121.4 

1926 

17 

-1.1755066 

0.066756 

1.0668 

300.748 

358.4 

119.2 

1927 

18 

-1.3324483 

0.046511 

1.0465 

306.568 

283.4 

92.4 

1928 

19 

-1.4893900 

0.032405 

1.0324 

310.758 

363.2 

116.9 

1929 

20 

-1.6463317 

0.022577 

1.0226 

313.742 

446.5 

142.3 

1930 

21 

-1.8032734 

0.015730 

1.0157 

315.858 

279.7 ■ 

88.6 

1931 

22 

-1.9602151 

0.010959 

1.0110 

317.348 

199.1 

62.7 

1932 

23 

-2.1171568 

0.007636 

1.0076 , 

318.394 

114.2 , 

35.9 

1933 

24 

-2.2740985 

0.0053199 

1.0053 

319.128 

160.0 

50.1 

1934 

25 

-2.4310402 

0.0037065 

1.0037 

319.640 

229.4 ' 

71.8 

1935 

26 

-2.5879819 

0.0025824 

1.00258 

320.001 ! 

328.9 

102.8 

1936 

27 

-2.7449236 

0.0017992 

1.00180 

320.250 

371.2 

115.9 

1937 

28 

-2.9018653 

0.0012535 

1.00125 

320.426 

400.7 

125.0 

1938 

29 

-3.0588070 

0.0008734 

1.00087 

320.547 

207.4 

64.7 

1939 

30 

-3.2157487 

0.0006085 

1.00061 

320.631 

298.1 

93.0 

1940 

31 

-8.3726904 

0.0004239 

1.00042 

320.692 

372.4 

116.1 

1941 

32 

-3.5296321 

0.0002954 

1.00030 

320.730 

403.2 

125.7 








y ^ - 3,788. 7t 

j 



j 






Mi^ Source: Data from Statittical Abstract of the' United States, 1933, p. 334; and Survey of 
Current Business, Annual Supplement, Vol. 12 (1932), and current issues, passim, 

* * Jb - 320.82630. See p. 679. 

t If the curve had been fitted according to the least-squares critedon, this sum wou)d 
approximate a hundred times the number of years, that is, 3,900. 

• ft t 
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The results lend support to. the hypothesis that automobile 
production in the United States had a gro\vth during those years 
following the law of the Pearl-Reed logistic curve. The goodness 



Fiq. 146. — Automobile production in the United States, 1903-1941. Pearl-Reed 
curve fitted by method of selected points. 

of fit of the trend is attested to, not only by the plotting of the 
curve with •the data in Fig. 145, but also by the fact that the 
sum of the ratios of the raw data to the trend equals approxi- 
ipately a hundsed times^the number of years. 
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ILLUSTRATIONS OF EMPIRICAL TRENDS 

The distinction between rational trends and empirical trends 
lies, not in the method of calculation, but in the interpretation 
and analytical use of the trend after it is calculated. Yet, in the 
case of empirical trends, it frequently suffices to fit a trend line 
of very simple character. Thus a straight line may be quite 
adequate in some cases. 

Straight-line Trend. Table 83 contains a work sheet for 
calculating a straight-line trend in open-market commercial 
paper outstanding for the period 1931-1941. Rationalization 
of this trend is uncertain — it may be the commencement of a new 
period of growth in what was supposed to be a dying institution. 

Table 83. — Work Sheet for Calculating Index op Normal and Trend 

Straight line 

Data; Open-market commercial paper outstanding. Annual averages of 

monthly data 
(In millions of dollars) 

Equation of trend: y' — 206 -h 12.49f (origin at 1936) 

Source: Annual Report of the Federal Reserve Board, 1929, p. 121; 
1935, p. 174. Survey of Current BuainesSy Annual Supplement, Vol. 20 
(1940), p. 47 and current issues, passim. 


— m — 

Year 

(2) 

Raw 

data 

V 

— m — 

t 

(4) 

<* 

— m — 

ty 

(55 

Trend 

V' 

— 

Index of 
computed 
trend 

V 

V 

1931 

264 

-5 

25 

-1,320 

144 

183.3 

1932 

106 

-4 

16 

-424 

156 

67.9 

1933 

95 

-3 

9 

-285 

168 

56.5 

1934 

156 

-2 

^ 4 « 

-312 

181 

86.2 

1935 

174 

-1 

1 

-174 

194 

89.7 

1936 

188 

0 

0 

0 

206 

91.3 

1937 

296 

1 


296 

218 

135.8 

1938 

239 

2 


478 

231 

103.5 

1939 

198 

3 


.594 

243 

81.5 

1940 

234 

4 


936 

256 

91.4 

1941 

317 

5 


1,585 

268 

118.3 


2,267 

0 

no 

1,374 


1,105.4* 

iV' - 11 



zt* 

2ity 

« 

25 - 


This total is a cross check on the work sheet; it should equal a hundred times the number 
of yean. Failure to check precisely is due to rounding. c 
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or it may be merely a cyclical movement. At any rate, for the 
period of 11 years selected the trend analysis makes possible a 
better study of the shorter term cyclical or residual movements in 
the data. 

The woric sheet contains all the information necessary to 
calculate the equation of trend, which in this case is of the simple 
form, y' - a + bt As seen in Eq. (5), the equation is found 
by the following: 




From the work sheet, for this particular problem, 
Sy = 2,267 2< = 0 = 110 Xty = 1,374 


Accordingly, 


a-?f^.206 

‘ - W - 


A = 11 


and the equation of trend is y' = 206 + 12.49^ (origin at 1936). 
It is necessary to specify the origin in order to know for which 
year = 0. If the origin were 1931, the equation would be 
y' = 144 + 12.49^ (origin at 1931); this equation describes the 
same straight line as y^ = 206 + 12.49^ (origin at 1936). 

Column (6) of the work sheet contains the solutions of the 
trend equation for the respective values of t Thus, for 1933, 
i = —3, and the solution of the trend equation for that year is 
2/' = 206 + (-3) (12.49) =,168. 

Column (7) is the index of computed trend, each y of the raw 
data divided by the corresponding y' of the trend, and the result 
expressed as a percentage. Thus, 264 is 183.3 per cent of 144. 

Polynomial Empirical Trends. Laborsaving Devices, It is 
possible to find a second-degree, third-degree, or higher degree 
polynomial trend by the methods already illustrated. To fit a 
second-degree polynomial, according to the least-squares criterion, 
the work sheet would be like that illustrated in skeleton form 
on page 57p. But it is better, for practical use, to introduce two 
important sets of laborsaving devices before proceeding to fit 
the higher ordpr polynomial trends. The first set of laborsaving 



584 


STUDY OF DYNAMIC VARIABILITY 


devices has to do with economy of calculation in the work sheet; 
the second set has to do with solving the equations for different 
values of therefore, with computing trend values for various 
years. 

Economy of Calculation in the Work Sheet As already noted, 
an economy was obtained by taking an odd number of years and 
making the median year the origin, so that = 0, = 0, and 

similarly the total of all odd powers of t will be equal to zero; 
hence, columns in the work sheet for odd powers of t are not 
required. In addition, the entry of columns in the work sheet 
for the even powers of t may be avoided because 'Lt^, 2i^®, 

etc., can be computed from formulas. It can be shown by 
algebraic derivation^ that, if t runs integrally from t = ± 1 to 

N + 1 

t = ± (n — 1), in which n = — g — ’ ^ ^ (terminal value) 

+ 1 , 




n(n — l)(2n — 1) 
3 

3n^ — 3n - 




in\n^ — 2n) + 3n + 


I 


( 10 ) 


By similar algebraic computation can be evaluated in terms 
of n, but it is preferable to use orthogonal polynomials if a trend 
equation of fourth or higher degree is sought. 

A second economy for the work sheet is secured by using a 
subtotal summation procedure by which aggregates Si, S2, S3, 
Si, etc., are obtained. From these aggregates algebraic formulas 
are used to compute as follows: 

Sy = Si 
Sty = nSi — S2 

St^ = n^Si — (2n -f* 1)S2 •+* 2S8 
St^ - n®Si - ( 3 n 2 + 3 n + 1)S2 + 6(n + 1 )S 8 - 6S4 

in which n = — s — 



< ^ Cy. Ross, Frank A., ** Formulae for Facilitating Computation in Time 
Series Analysis,” Journal of the American Stati^ical Aeeociationj Vol. 20 
(1925), pp. 75-79. For method of proof see footnote 1 p. 686. 
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Table 84. — Economical Work Sheet for Calculating Polynomial 
Trend, Algebraic Illustration 
Method of least sqiuires 


Yiar | 

Data 

Sets of subtotals 


T 

t 

y 

First 

Second 

Third 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

1937 

1 

-2 

1/1 

yi 

VI 

VI 

1938 

2 

-1 

Vi 

yi + yt 

2vi 4- Va 

3vi 4 yt 

1939 

3 

0 

V» 

yi-\-yi + yi 

3vi 4' 2va 4" Vs 

6vi 4 3va 4 VI 

1940 

4 

1 

y* 

yi + ya + i/i + 1/4 

4vi 4~ 3va 4" 2vs 4* V4 

lOvi 4 Ova 4 3vs 4 yi 

1941 

6 

2 

yb 

yi 4- va + Vs +■ V4 4- yi 

5vi 4 4vs 4 3vs 4 2 v 4 4 vs 

15vi 4 lOva 4 6va 4 3v4 4 V» 

i 



B 

St 


S4 


The subtotal summation process is illustrated algebraically in 
Table 84 and arithmetically in Table 85. The sum of column 
(4), Sij is merely Xy. Column (5) contains the first set of sub- 
totals, which is obtained on the adding machine by taking a sub- 
total after entry of each item in column (4); the first subtotal 
in column (5) will thus be the first item of column (4), therefore, 
yi, the second subtotal in column (5) will be yi + 2 / 2 , the third 
subtotal will be yi + yz + ys, etc. 82 is the sum of these 
subtotals. 

The second set of subtotals, column (6), consists of subtotals 
of the figures in the preceding column, column (5) ; thus the first 
subtotal in column (6) is yi, the second subtotal is 2 yi -f- ^ 2 , the 
third subtotal is 3yi + 2 y 2 + 2 / 3 , etc. 82 is the sum of the 
second set of subtotals. 

The third set of subtotals, ^column (7), consists of the subtotals 
of figures in column (6) ; and 8 a is the sum of this third set of 
subtotals. 

This process of taking subtotals and aggregating the subtotals 
by columns to obtain 82, 83, 8 a can be repeated to as many as 
desired, depending on how high degree a polynomial is to be 
fitted. If carried as far as S 4 , a third-degree polynomial can be 
fitted. 

A cross check on the work sheet is noted in Table 85: 81 is 
equal to th^ final subtotal in column (5), 82 is equal to the fin^l 
subtotal in column (6), Sz is equal to the final subtotal in colump 
(?), etc. 
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Table 85. — Economical Work Sheet for Calculating Polynomial 
Trend, Arithmetical Illustration 
{Method of least squares) 


4 

Year 


Data 

Sets of subtotals 


T 

t 

y 

First 

Second 

Third 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

1937 

1 

-2 

2 

2 

2 

2 

1938 

2 

-1 

6 

7 

9 

11 

1939 

3 

0 

8 

15 

24 

35 

1940 

4 

1 

7 


46 

81 

1941 

5 

2 

9 


77 

158 




31 


158 

287 




-S. 


St 

St 


From Table 84 it can be readily seen that algebraically 

51 = 2/1 + 2/2 + 2/3 + • • * + I/at 

5 2 - Nyi + (N - 1)2/2 + {N - 2)2/3 + • • • + 2 /^ (12) 

„ _N(N + 1) , (AT - DAT (AT - 2)(N - 1) 

+ • ■ • + J/iV 

For the coefficient of yi in this sum is equal to the sum of the natural 

^ JLT r fkT I , N 

numbers from 1 to Ny therefore, ^ T, which equals 2 J the coeffi- 

cient of 2/2 is the sum of the natural numbers from 1 to — 1, which equals 


^NiN + l)iN^2) (iV~l)XiV)(JNr + l) 

= ^ 2/1 H g 2/2 + • • * + 2/Ar 

^ This may be demonstrated as follows: 
sr-l-l-2+3-f-4+***+W 
Also, 

ST = iV + (W - 1) -h (W - 2) H- (AT - 3) + • • • -f 1 
By adding, 

2sr - (AT + 1) + (iV + 1) + (W 4- 1) -f (iNT + 1) + • • • + (AT + 1) 
- N(N + 1) 
and hence 
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For the coefficient of yi is the sum of — — as N goes from 1 to AT ; 

1 + i)(Ar +2) , 

this sum equals — ^ : etc. 

o 

It will be convenient to express these sums in the following 
manner: 


/Si = 2!/ 

>82 = 2(Ar + 1 ~ T)y 

In^he case of yi, iV + 1 — In the case of 

AT -f 1 - T = AT - 1. 

In the case oi yz^ N \ — T = N — 2, Etc. 

(AT + 1 - T){N + 2 - T) 


S, 


=2: 


y 


In the case of yi, 


(N + \- T)(N + 2 - T) _ N{N + 1) 


In 


rnnr nf „ (^ + 1 " ^XN + 2 - T) _ (N - DN t..- ) (13) 

the case of 1 / 2 , 2 2 


s. 


^ V (isr + 1 - T){N + 2 - T)iN + 3-T) 


6 


y 


In the case of 2 / 1 , 

(AT + 1 - T){N + 2 - T)(N + 3 - T) _ N(N + l)(Ar + 2) 
6 6 

In the case of 2 / 2 , 

(AT + 1 ~ T){N + 2 ~ T){N + 3 ~ T) _ (AT ~ 1)N{N -f 1) 

6 6 


In the above equations, if T is replaced hy t + T — t + 

N + I 

and if, by definition, n = — g — ' these equations become 


N + 1 


Si = Sy 
Si = nly — Ity 
2/S8 = S(n — tyy + nSy — Sfy 
6St = 2:(n - OV + 32(n - t^y + 2n2y - 22<y , 


(14) 


in which the unmarked 2 refers to summations with respect to 
N — I N — I 

t from H 2 — 1® 2 — these equations are expanded 

and similar. terms assembled, Eqs. (11) are obtained. .. 
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Table 86. — Wobk Sheet fob Calculating Tbend — Second-degbeb 

Polynomial 
Method of least sgmres 

Data: Consumer expenditures for personal appearance and comfort. 
Annual data in millions of dollars 


SouBCE OP Data: Survey of Current Business^ October, 1942, p. 24. 


Year 

t 

y 

Sets of subtotals 

First 

Second 

1929 

-6 

655 

655 

655 

1930 

^5 

630 

1,285 

1,940 

1931 

A4 

540 

1,825 

3,765 

t932 

-3 

427 

2,252 

6,017 

1933 

~2 

347 

2,599 

8,616 

1934 

-1 

393 

2,992 

11,608 

1935 

0 

441 

3,433 

15,041 

1936 

1 

503 

3,936 

18,977 

1937 

2 

545 

4,481 

23,458 

1938 

3 

543 

5,024 

28,482 

1939 

4 

540 

5,564 

34,046 

1940 

5 

568 

6,132 

40,178 

1941 

6 

653 

6,785 

46,963 



6,785 

46,963 

239,746 



Si 

S 2 

Si 


By using Eqs. (11), page 584, the following values are obtained: 


Zy = 6,785 

Zty = 7(6,785) - 46,963 = 532 
Xt^y =F 49(6,785) - 15(46,963) + 2(239,746) = 107,512 

By using Eqs. (10) the following values are obtained: 

= 182 

Xt* = 182 ~ = 182(25) = 4,560 

To find the second-degree polynomial trend equation these values 
may be substituted in Eqs. (7), (i) to (iii), page 670, as follows: 

(i) 6,785 = 13o + 182c 

(ii) 532 = 1826 6 == 2.923 

(iii) 107,612 = 182o + 4,660c 

Ci') 94,990 = 182o + 2,548c (i) X 14 

' 12,622 = 2,002c c = 6.2647 
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(iii) 107, 5i2 = 182a + 4,550c 

(i") 169,625 = 325a + 4,550c (i) X 25 

62,113 = 143a a = 434 

Accordingly, the second-degree polynomial equation of trend 
for the problem illustrated is y' = 434 + 2.923i + 6.2547^^ 
(origin at 1935). 

Finding Trend Values by Method of Finite Differences, The 
equation for a trend line having been found, the problem is to 
compute from this equation the values of y' pertaining to a given 
set of years. Direct substitution is laborious. Finite differ- 
ences provide an easier method. The keystone of the latter 
method is the fact that the nth difference of a polynomial of the 
nth degree is constant. Hence, that constant nth difference 
having been determined, the other differences, and ultimately 
the desired trend values themselves, can all be computed by 
merely reversing the differencing process, f.e., by simple addition. 

In the equation y' = a + bt + ct^ the first difference, by 
definition, would be 

AV = a + + 1) + c(t + ly — a -- bt — ct^ = b + 2ct + c 

and, by definition, the second difference would be 

A^' = b + 2c{t + 1) + c — & — 2ci — c = 2c 


Table 87. — Building Up a Polynomial by Finite Differences 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

t 

Fourth 

difference 

AV 

Third ' 
difference 

AV 

Second 

difference 

First 

difference 

Aij/' 

Polynomial 

(trend) 

values 

y' 

-4 

-1 

6 

-48 


305 

-3 

-1 

5 

-42 


525 

-2 

-1 

4 

-37 


697 

-1 

-1 

3 

-33 


827 

0 

-1 

2 

-30 

60 

920 

1 


1 

-28 

30 

980 

2 

‘ ’ 


-27 


1,010 

3 





1,012 " 

4 


0 



987 * 
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Table 87 illustrates the building up of a polynomial by finite 
differences. The polynomial here is of the fourth degree, and 
hence its fourth differences are all identical. 

A figure in a given line of column (6) algebraically added to 
the figure in the same line of column (5) gives the next figure for 
column (6); thus, 305 + 220 = 525, 525 + 172 = 697, etc. 
Similarly, a figure in a given line of column (5) added algebraically 
to the figure in the same line of column (4) gives the next figure 
for column (5); thus, 220 — 48 = 172, 172 — 42 = 130, etc. 
The same general rule applies to the, figures in columns (2) 
and (3); thus, —48 + 6 = —42, —42 + 5 = —37, etc., and 
6 — 1 = 5, 5 — 1 = 4, etc. 

In the polynomial illustrated in Table 87, the polynomial is 
known to have a constant fourth difference. Hence, if the 
polynomial value and the differences of any one line are all 
known, then the differences and polynomial values for all other 
lines above or below the given line can be readily computed. 
Thus, if for < = 0, it is known that the polynomial value y' = 920, 
A'2/o = 60, = —30, A^o = 2, and the constant fourth dif- 

ference is equal to —1, then, by working from right to left and 
up and down, the other values in the tables can be built up. 
The first set of variable differences, in this case the third, can be 
built up cumulatively from the known = 2 and the con- 
stant difference —1. It is to be noted that in a downward 
direction, in this table, this constant difference is —1; so in 
building up from the bottom to the top the constant difference is 
algebraically —( — 1), or +1. This rule follows also for the 
building up of the other differences. 


Table 88. — Aid for Computing Finite Differences at i =» 0 in 
Polynomial y * — a hi + dt ^ • • • 


Parameter 

(1) 

(2) 

(3) 

(4) 

(6) 


A»yo' 

A»yo' 

A*j/o' 

A»i/o' 


h 

1 





c 

1 

2 




d 

1 

6 




'* e 

1 

14 


24 ' 


■ / 

1 

30 

150 

240 

120 





-A 
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This method is of general validity and can be used to find 
values of a polynomial of any degree from knowledge of its 
value for one year and its differences for that same year. For- 
tunately, it is relatively easy to calculate the or polynomial 
value and the various differences for the year i = 0. If the 
form of the polynomial is = a + + • • • , 

then the polynomial value for ^ = 0 is y' = a. The first, second, 
and higher order differences for ^ = 0 can be computed with the 
help of Table 88. 

The figures in Table 88 give the weights by which the param- 
eters fe, c, d, e, /, etc., must be multiplied to give the difference 
specified at the top of each column, as follows: 

= & + c + d + e+ * * • 

AVo = 2c + 6d + 14^ + 30/ + • • • 

= 6d + 36e + 150/ + * * * 

A V'o - 24c + 240/ + - • 

A®2/'o = 120/ + • - 

For a particular polynomial, each of these equations, of course, 
terminates with the coefficient of Thus, for a second-degree 
polynomial y' = a + 6^ + the formulas for the differences at 
t = 0 would be A^J = b + c and = 2c, the higher differ- 
ences being zero since the second difference is the same for all 
values of t. For a third-degree polynomial, 

y' = a + bt + ct^ + dt^ 

the formulas would be A^i/J = 6 + c + d, A^yJ = 2c -f- fid, and 
A®yJ = fid. For a fourth-degree polynomial 

y^ — a + bt + ct^ + dt^ + et^ 
the differences would be A^i/o = 6 + c + d + e, 

A^y'o = 2c + fid + 14e, A^o = fid + 3fie 
and A^y'o = 24c. 

For higher degree polynomials, the table can be readily 
extended by the rule that a figure in a given line of a given column 
is equal to the number of the column multiplied by the sum of 
the two figures in the line above situated in the given column 
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and in the column to the left, respectively. For example, 
36 - 3(6 + 6); 24 = 4(0 + 6); etc.i 

The use of finite differences to compute the trend values of a 
second-degree polynomial is illustrated in Table 89. The trend 
is y' == 434 -f- 2.923^ + 6.2547^^ (origin at 1936), calculated 
above in Table 86 for data on consumer expenditures for per- 
sonal appearance and comfort in the United States, 1929-1941. 

In Table 89, the constant second difference is known to be 
12.509; the first difference for < = 0 is 9.2,* and the trend value 
for ^ = 0 is y' = 434.0. These are first entered in the work 

Table 89. — Work Sheet for Computing Trend Values by Method op 
Finite Differences 

Equation of trend: = 434 + 2.923^ + 6.2547^* 

Value of 2 /o = 434 

Value of A Vo = 2.923 + 6.2547 = 9.1777 


Constant = 12.509 


Year 

t 

AV 

AV 

v' 

1929 

-6 

12.509 

-65.9 

641.5 

1930 

-5 


-53.4 

575.6 

1931 

-4 



522.2 

1932 

-3 


-28.3 

481.4 

1933 

-2 


-15.8 

453.1 

1934 

-1 


-3.3 

437.3 

1935 

0 


9.2 

434.0 

1936 

1 


21.7 

443.2 

1937 

2 


34.2 

464.9 

1938 

3 


46.7 

499.1 

1939 

4 


59.. 2 

545.8 

1940 

5 


71.7 

605.0 

1941 

6 



676.7 


sheet; then, since the constant second difference is positive, the 
remainder of the column of first differences is obtained by suc- 
cessively subtracting 12.509 to obtain first differences for earlier 
years and by successively adding 12.509 to obtain first differences 
for later years. Obtaining the trend values is illustrated as 
follows: 434.0 + 9.2 == 443.2, 443.2 + 21.7 = 464.9, etc.; for 
values before 1935, 434.0 - (-3.3) == 437.3, 437.^ - (-15.8) 

, ' Cf. Whittaker and Robinson, op. cit.^ pp. 1-7. 

♦ The first differences may be rounded without causing cumulative error. 
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# 


— 453.1, etc. Beginning at the top of the table, it will then 
be found that 641.5 - 65.9 = 575.6, 575.6 - 53.4 = 522.2, etc. 

While the explanation of the method of finite differences may 
be extended, its use in solving second-degree polynomials for 
various values of t is much more expeditious than the method of 
obtaining solutions to the equation for the various values of t by 
substitution in the equation. The labor involved in the longer 
method is great if the number of years is large or if the poly- 
nomial is of higher than a second degree. In contrast, the 
method of finite differences may be used without difficulty, and 
the arithmetic involved is always simple addition or subtraction. 

A danger inheres in the use of finite differences, namely, that 
any error in the higher order differences is cumulated as the lower 
order differences are computed. For this reason, when the 
trend line is determined the coefficients of the higher powers of 
t should be carried out to a larger number of places than would 
be regarded as significant. If, for example, the coefficient of 
is rounded off to the fifth place, the maximum error in the fourth 
difference is 24 X 0.000005 = 0.00012, over a 7-year period. 
If the other coefficients have also been rounded off to the last 
place indicated, then the maximum error in 

A^t/'o = 6(0.00005) + 36(0.000005) = 0.00048 
in 

A^y'o = 2(0.0005) + 6(0.00005) + 14(0.000005) = 0.00137 
in 


A^i/S = 0.005 + 0.0005 + 0.00005 + 0.000005 = 0.0055555 


Table 90. — Maximum Cumulated Ebrors in Differences and Poly- 
nomial Values 
Error in = 0.000120 


t 

A®y' * 


AV 

1/' 

0 

0.000480 

0.001370 

0.005555 

0.050000 

1 

0.000600 

0.001850 

0.006925 

0.055555 

2 

0.000720 

0.002450 

0.008775 

0.062480 

3 

0.000840 

0.003170 

0.011225 

0.071255 

4 


0.004100 

0.014395 

0.082480 

5 

9 .... 


0.018495 

0.096875 

6 




0.115370 
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and in = 0.05. Thus, by the time the seventh year, for 
example, has been computed, the maximum error in that figure 
becomes 0.11+. This is shown in Table 90. 

The final error grows larger the further the work proceeds and 
thus makes it necessary to compute the coefficients of higher 
powers of t to several figures beyond the number of significant 
figures required in the computed trend values. A cross check 
on the method of finite differences would be to solve the poly- 
nomial equation for the terminal values of t. 

The danger of cumulative error is reduced to a minimum by 
starting at ^ = 0 and accumulating upward through the — ^^s 
and accumulating downward through the +^^s. 

Analysis of Cycles by Empirical Trends. Data on plate- 
glass production in the United States, 1933-1941, have been 
selected, in order to illustrate how cycles may be studied by 
empirical trend analysis. Table 91 is a work sheet providing the 
figures needed to compute either a straight-line trend or any 
polynomial trend up to the third degree. 

Table- 91. — Work Sheet for Compxjting Trend and Index of Normal 
Method of least squares 

Data: Production of plate glass, polished, in the United States 
(In millions of square feet, monthly) 

Source: Survey of Current Business, Supplement, Vol. 20 (1940), p. 151; 
Vol. 21 (February, 1941), p. 99, Annual (March, 1942), p. S-35. 


, Year 

t 

V 

Sets of subtotals 

First 

Second 

Third 

■■ 


7.2 

7.2 

7.2 

7.2 

■ia 


7.9 

.15.1 

22.3 

29.5 

1935 


15.0 

30.1 

52.4 

81.9 

1936 

-1 

16.5 

46.6 

99.0 

180.9 

1937 

0 

16,0 

62.6 

161.6 

342.5 

1938 

1 

7.1 

69.7 

' 231.3 

573.8 

1939 

2 

11.8 

81.5 

312.8 

886.6 

1940 

3 

13.7 

95.2 

408.0 

1,294.6 

1941 

4 

15.9 

111.1 

519.1 

1,813.7 



111.1 

519.1 

1,813.7 

5,210.7 



-Si 

St 

Sz 

5 . 


By using Eqs. (10), the values of 2^^, and 2^® are found 
as follows: 
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2 - 


_ n(n - l)(2n - 1) _ ^)(9) _ 

3 ““ 3 ”■ ^ 

= ^ <2 - 2w) + 3n + 1 j 

. 00 [25(25^ . 00 (Mil) 

= 9,780 


= 708 


= 60(163) 


By using Eqs. (11), the values of T^ty, 'Lt'^y, and are found 
as follows: . 


Xy = Si = 111.1 

2«y = nSi - Si = 5(111.1) - 519.1 = 36.4 
= n^Sx - (2n + 1)^2 + 2^3 
= 25(111.1) - 11(519.1) + 2(1,813.7) 

= 694.8 

2i*j/ = n^Si — (3w^ “t" 3n. -f- 1)52 "h 6(w -f- 1)53 — 654 
= 125(111.1) - 91(519.1) + 36(1,813.7) - 6(5,210.7) 
= 678.4 


From these, two trend lines may be computed, first a straight 
line, and second a third-degree polynomial, as follows: 
Straight-line trend: 


J/1 = ■ 


111.1 , 36.4 




y[ = 12.3 -t- 0.607< 


(origin at 1937) 


Third-degree polynomial trend: 
The normal equations are 


2y = iVa + 62< -f c2<® -f d2f» 
2<y = a2< -f- bW + + d2t* 

Wy = a2<2 + hZt^ + c'Zt* + d2<» 
2<>j/ = o2<» -f bXl* -1- c2«« + d2f« 


in which all the sums of the odd powers of t are equal to zero, 
so that the equations for finding a, b, c, d are as follows: 
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(i) 111.1 = 9a + 60c 


(ii) 


36.4 = 

606 + 708d 



(iii) 


694.8 = 

60a "b 708c 



(iv) 


678.4 = 

7086 + 9,780d 



(iiO 


429.52 = 

7086 -f 8,354.4d 

(ii) 

X 11.8 

(iv)- 

- (ii') 

248.88 = 

l,425.6d 


d = 0.17457 

(iii) 


694.8 = 

60a 708c 



(iO 


1,310.98 = 

106.2a + 708c 

(i) 

X 11.8 

(iO- 

■ (iii) 

616.18 = 

46.2a 


a = 13.34 


Substituting d in Eq. (ii), b = —1.46325 

Substituting a in Eq. (i), c = —0.14891 


The third-degree polynomial trend equation is thus 

y'i = 13.34 - 1.45325< - 0.14891<* + 0.17457<» (origin at 1937) 

By using the method of finite differences to solve for various 
trend values, from Table 88 above, at t = 0, 

Vi = 13.34 

and 

AVj ^ -1.45325 - 0.14891 + 0.17457 
= -1.42759 

AVj = 2(-0.14891) + 6(0.17457) 

= 0.7496 

A»j/; = 6(0.17457) 

= 1.04742, 

which is a constant difference in this case. 

In Table 92, trend values are built up for the problem by using 
the method of finite differences. , First, opposite f = 0, the 
value of y', the first, second, and third differences are entered. 
The constant third difference, 1.04742, is then subtracted suc- 
cessively in the —t direction (upward in the table); it is then 
added successively in the +t direction (downward in the table). 
For example, starting at t = 0, the second difference is 0.74960; 
the second difference at / = — 1 is 

0.74960 - 1.04742 = -0.29782 

the second difference at t = — 2 is 

-0.29782 - 1.04742 = -1.34524; etc. 
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Starting again at < = 0, the second difference is again 0.74960; 
the second difference at ^ = +1 is 


0.74960 + 1.04742 = 1.79702; etc. 


The column of first differences is built up from the column of 
second differences. For example, starting at t = 0, the first 
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Fia. 146. — Production of plate glass, polished, in the United States, 1933-1941. 
Straight-line and third-degree polynomial trends shown with raw data. 


difference is —1.42579; the first difference for f = —1 is then 
— 1.42579 — (—0.29782) = —1.12797, the first difference for 
< = -2 is -1.12797 - (-1.34524) = +0.21727; etc. Again, 

Table 92. — Work Sheet for Finding Trend Values by Method op 
Finite Differences 

Equation of trend: y, = 13.34 - 1.45325< - 0.14891<» + 0.17457<» 
(origin at 1937) 


Year 

t 


AW 

AW 

pm 

1933 

-4 


-3.44008 

6.05001 


1934 

-3 


-2.39266 

2.60993 


1935 

-2 


-1.34524 

0.21727 


1936 

-1 


-0.29782 

-1.12797 


1937 

0 

1.04742 

0.74960 

-1.42579 

13.34 

1938 

1 


1.79702 

-0.67619 

mESm 

1939 

2 


2.84444 

1 . 12083 

■1^1 

1940 

3 



3.96527 


1941 

4 


1 


■1 
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starting at ^ = 0 with the first difference —1.42579, the first 
difference at t = +1 is —1.42579 +0.74960 = —0.67619; the 
first difference at i = +2 is —0.67619 + 1.79702 = 1.12083; etc. 
The values of are found from the first differences in exactly the 
same manner as the first differences from the second differences. 

The results of the trend analysis are shown graphically in 
Fig. 146. If it can be assumed that the period of 9 years covered 
by the whole period is a segment in a longer cyclical movement, 
the straight-line trend may be considered to measure a part of 
that longer cycle — part or all of its upward movement. The 
shorter cycle is then shown by the polynomial trend. Plate- 
glass production appears to have gone through one complete 
short cycle from about 1934 to about mid-1940. 



CHAPTER XXII 

ORTHOGONAL-POLYNOMIAL TRENDS 


Great economy in trend analysis is secured by the use of 
orthogonal polynomials, especially if the trend desired is of 
higher degree than second-degree polynomial. It requires con- 
siderable space to explain and describe the method of orthogonal 
polynomials, which may seem to belie the fact of its economy in 
use, but the actual arithmetic of application is simple. When 
lines of regression involving more than three (ioefficients are 
fitted to time series by the least-squares criterion, the work of 
computation by the ordinary method increases very rapidly. 
Laborsaving devices introduced in the preceding chapter, includ- 
ing the use of the summation work sheet and the determination of 
2^2, 2^®, etc., by formula, help to keep the amount of cal- 

(Milation at a minimum; but further reduction in the amount of 
calculation and particularly in the magnitude of the figures that 
have to be handled is obtained by using orthogonal polynomials. 

A polynomial” is an algebraic expression of the form 

a + bt + 

which, for example, is a polynomial in t of the second degree. A 
polynomial in t of the fourth degree would be 

a + bt + + dt^ + et^ 

and so forth. ‘^Orthogonal” polynomials are polynomials that 
bear a certain relationship to each other, to be described below. 
The use of orthogonal polynomials involves merely a special 
method of computing the coefficients of a trend line; the method 
of fitting is still the method of least squares. 

One of the greatest advantages of using orthogonal-polynomial 
trends is that, if the investigator decides to fit either a higher or 
lower degree trend line than what he has already derived, the^ 
amount of work involved in these further calculations is reduced 
to a minimum. In fact, no extra work at all would be required 
* 599 • . 
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to determine the equation for a trend line of lower degree^ while 
the determination of an equation for a trend line of higher degree 
^ would require onfy the calculation of quantities pertaining 
directly to the ad^ed term and would not necessitate any recal- 
culations of other quantities. The work already done will 
therefore not be wasted. 

Orthogonal Polynomials. Suppose a variable t has a set of 
values, say f rom 0 to 3. If each of these values is substituted in 
a polynomiaf in t, the polynomial will take on a corresponding 
set of values. Thus, if pi = i — 1.5 is a given polynomial in 
then, as t has the values 0, 1, 2, and 3, pi has values —1.5, 
—0.5, +0.5, and +1.5. Another polynomial in t, say 

, P2 = - 3t + l 

will have a different set of values; in this instance, it will have 
the values 1, — 1, — 1, and 1 when t has the values 0, 1, 2, and 3, 
respectively. 

Orthogonal polynomials are those that bear special relation- 
ships to each other. The necessary condition for two poly- 
nomials to be orthogonal to each other is that the sum of their 
product for all values of t shall be equal to zero. That this 
necessary condition is met by i — 1.5 and p 2 = ~ 3< + 1 

is readily seen. Thus, when ^ = 0, 

piP2 = (« - L5)(«2 ~ 3^ + 1) = -1.5 

when ^ = 1, pip 2 = +0.05; when ^ = 2, pip 2 = —0.05; and 
when ^ = 3, pip 2 = +1.5. Hence, 

= -1.5 + 0.5 - 0.5 + 1.5 = 0 

The polynomials pi = ^ — 1.5 and p 2 — — St + 1, 

ingly, possess the orthogonal property. 

In general, if a set of polynomials in <, say pi, p 2 , ps, . 
form an orthogonal set, then it is necessary that 

Spip2 = 0 Spip3 = 0 • • • Spipr = 0 ] 

Sp2P8 = 0 2:p2P4 =0 • • • 2p2pr = ^ > (1) 

SP8P4 = 0 Sp3p6 =0 • • • SpaPr = 0 j 


accord- 
• • y Vn 


These are the general conditions that must be' satisfied by 
^orthogonal polynomials. Notice that they are equivalent to the 
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conditions that the correlation between each pair of polynomials 
is zero. 

Trend Line in Orthogonal Polynomials by the Method of Least 
Squares, The form of a trend-line equation that has so far been 
used is y' = a + bt ct^ + dP . This is an arbitrary 
form, however, and it is to be noted that other forms of the 
identical equation are possible. This can be illustrated numeri- 
cally as follows: 

The equation y' = 105.3 + 8.1^ — 0.7^^ is identically the same 
as 2 /' = 116 + 6(^ — 1.5) — 0.7(^^ — 3^ + 1), which may be 
proved by multiplying out the expressions in the latter equation 
and collecting like terms. If the use of the second form has any 
advantage over the use of the first, there is no reason why it 
may not be adopted. 

Suppose, now, that instead of fitting a trend line in the form 
y' = a + bt + cP + dPy the fitting process is carried out with 
respect to the form 

2/' = A + Bpi + (7p2 + Dpz + Epi 

in which pi, p 2 , Ps, and p 4 are polynomials in t of the first, second, 
third, and fourth degree, respectively, that are orthogonal to 
each other and to unity, that is to say, where pi is a polynomial 
in t of the form pi = Aio + tj p 2 is a polynomial in t of the form 
P 2 = k 2 Q + k 2 it + P, ps is a polynomial in t of the form 

Pa = *30 + *31^ + kz 2 p + k, etc. 

and where Spi = 0 , Sp 2 = 0 , Spa = 0 , Sp 4 = 0 , and 2 ;pip 2 = 0 , 
2pip3 = 0, Spip 4 = 0, Sp 2 P 3 = 0, etc. With reference to the 
arithmetical illustration givqp above, which was a third-degree 
polynomial, this is equivalent to deriving a trend line of the form 

2/' = 115 + 6(^ - 1.5) - 0.7 - 3i + 1) 

instead of the usual form = 106.3 + 8.1^ — 0.7^^. 

^Either method will, of course, give the same result; for, 
wl^ichever form is derived, it can be converted into the other by 
simple algebra. It is the purpose of this section to show the 
simplification gained by using the orthogonal-polynomial form 
rather than^the usual form. The problem of finding the forms 
of the polynomials themselves, t.c., the values of the k coefficients^ 
will be left for ^ subsequent section. 
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If a trend line is put in the orthogonal-polynomial form 
y' == A + Bpi + Cp2 + Dps 

and is then fitted by the method of least squares, i.e., if A, B, C, 
and D are determined so that 


2(y - y'y = 2(2/ - A - Bpi - Cp2 - Dpzy 

is made a minimum, the following conditions are obtained: 

2 ( 2 / — A — Bpi — Cp2 — Dps) = 0 

2pi(2/ — A — Bpi — Cp2 — Dps) = 0 

2 ^ 2 ( 2 / - A - Bpx ~ Cp2 - Dps) = 0 

2 ^ 3 ( 2 / - A — Bpi - Cp2 - jDps) = 0 

or 

22/ = iVA + J52pi + C2p2 + Z)2p3 
2pi2/ = A2pi + B2pi + C2pip2 + DXpips 
Xp2y = A2p2 + BXpip2 + C2p2 + i>2p2P3 
2p32/ = A2p3 + BJipipz + C2p2P3 + D'Epl 


But since 1, pi, p 2 , and p 3 form an orthogonal set (by assump- 
tion), it follows that 2pi = 0, 2 p 2 = 0, 2 p 3 = 0, 2 pip 2 = 0, 
2pip8 = 0, and 2p2P3 = 0. Hence the above equations reduce 
to 


and 

and therefore 


2y = NA 
Xpiy = BXpl 
2p22/ = C2p^ 

Sp32/ = DXpl 



Spf 

^ Spl 
n _ ^y 

Spf 



( 2 ) 


The simple form of these solutions will be noted. It will also 
"be noted that the solution for A is independent of pi, pi, and 
p» and that the solution for B depends only upon pi, the solution 
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for C only upon p 2 , and the solution for D only upon pa. This 
means that the value of A would have been the same whether a 
first-, second-, or third-degree trend line had been fitted. Simi- 
larly, the value of B would have been the same whether a first-, 
second-, or third-degree trend had been fitted, and the value of C 
would have been the same whether a second- or third-degree 
trend line had been fitted. For if 4 + Bpi had been fitted, 
the solutions would still have been 


A - ^ and B - 

A ~ ^ and 

If y' = A + Bpi + Cp 2 had been fitted, the solutions would 
still have been 


A - 


N' 



and 


C = 


^P2y 

Xpl 


The addition of the term Cp 2 does not therefore change the 
values obtained for A or B, and the addition of the term Dps 
does not change the values obtained for A, or C. It also 
can be seen that if a fifth term were added to the trend line, 
namely, Ep^j making it a fourth-degree trend, the value of E 
would be given by E = Xp^y/l^pl and the values of A, 5, C, 
and D would be the same as before. It is this simplicity and 
independence of the solutions of the least-squares equations 
when orthogonal polynomials are used that give the orthogonal- 
polynomial method its main advantage over the ordinary 
method. 

. Forms of Orthogonal Polynomials Used, The forms of the 
orthogonal polynomials to 6e used for fitting trends can be 
generalized; what is required is to find the A's in terms of the 
given values of t and the number of years involved. The con- 
dition has been laid down that pi = fcio + ^, P 2 = ^20 + k 2 it + 
and ps = kzQ + kzit + kzit"^ + P, etc., are to be polynomials of 
the first, second, and third degree in t, respectively, that are 
orthogonal to each other and to unity. The problem is to make 
use of this condition to determine values for the k^s in terms of 
the given values of L When this is done, it will be possible to 
find the actual values of A, J3, C, and jD, from the formulas of th5 
preceding section. 
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By way of illustration, the forms of only pi and p 2 will be 
derived; the method can be readily extended to the determination 
of the forms of pz and of higher polynomials. 

First, it is assumed that the time intervals T are measured 
from the mean T, so that pi and p^ become pi = kio + t and 
P 2 = k 2 o + k 2 it + where t — T — T. In addition, it, is sup- 
posed that the time intervals to which the variable refers are 
equally spaced and without interruptions. According to these 
assumptions, t will have a mean of zero; its highest value will 

i\r — 1 . AT — 1 * 

be H 2 — lowest value 2 example, if 

there are 5 years of data, the middle year will be 0, the first 

year —2, and the last year +2. If there are 4 years of data, the 

3 11 

first year will be — ^5 second year — the third year + 

3 

and the last + 2 

Accordingly, all the odd moments of t, such as ^t/N, 2<VJV, 
and Zt^/Ny will be zero; the even moments, such as Xt^/N, 
Xt^/N, and Xt^/Nj are computable from simple formulas depend- 
ing entirely on iV, the number of years, as already noted in 
Chap. XXI. 1 

With these assumptions, the derivation of the form of the 
orthogonal polynomials, that is to say, the derivation of the 
values of ifcio, A 20 , and ^ 21 , niay now be undertaken. The con- 
dition that pi, P 2 , and 1 shall be orthogonal to each other requires 
that Xpi = 0, Xp 2 = 0, and 2pip2 = 0. These equations may 
be written as follows: 

Xpi == X(kio + 0 = Nkio + :^t^0 (i) 

Sp2 = X{k2o + k2it + = Nk2o + k2iXt + Xt^ = 0 (ii) 

Spip 2 = 'XpzikiQ + t) = kioXpz + Xpzt = 0 (iii) 

From these equations, the values of the fc’s can readily be 
obtained. Since Xt = 0, (i) gives Nkio = 0, or fcio == 0; and 

Xt^ 

Eq. (ii) gives Nk 2 o + = 0, or *20 = — From Eq. (ii), 

it is known that Sp 2 = 0; hence, Eq. (iii) becomes Xp 2 t = 0. 
Substituting the equivalent of p 2 , this gives the condition, 

« Xp 2 t = S(fc 2 o + k 2 it + V)t = kz^Xt + k 2 iXt^ + St* = 0 (iv) 

* * These assumptions were made in the preceding chapter. C/. Tables 
84 to 86, Chap. XXI. . • 

1 C/.‘pp*684, 686. . 
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Since both and Si® are equal to zero, this becomes 

fcjiSt® = 0 

and hence kn must be zero, since Si® is not. The values of the 
fc’s, therefore, are as follows: 

fcio = 0 \ 

• kii = 0 { /„■. 


and the forms of the polynomials pi, pt are therefore 

Pi = t 

’ .2 2i®- 

P® = - -A. 


in which 

Hence, 

Accordingly, 


= w(w - l)(2w - 1) 

iV + 1* 

Si® _ JV® - 1 
N ~12 


Pi = t^ — 


AT® - 1 


Similar methods of analysis may be used to derive the forms 
of Pi and higher polynomials. The results obtained for poly- 
nomials up to the fifth degr«e may be listed as follows:* 


Pi = t 
P2 = t^- 

pt- - 
Pi = — 

Ub = - 


IV® - 1 


12 


3iV® - 


20 


3JV® - 

13 

] 

14 


6(JV® - 

- 7) 


660 

- 230JV= 
1,008 


• Cf. Eqs. (10), .Chap. XXI. 

• ^ Cf. Fisher, R. A., Statistical Methods for Research Workers^ Seciion 27. 
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Thus, it is to be noted that a trend line can be fitted in two dif- 
ferent forms, by the method of least squares; it can be fitted 
in the form 2/' = a + 6^ + + . . . (where t = T — T) 

by the methods described in the preceding chapter, or it can be 
fitted in the form y' = A + Bpi + Cp 2 + Dpz + ... by the 
method of orthogonal polynomials. If the orthogonal-poly- 
nomial form is used in the fitting process, the ordinary form of the 
trend equation can readily be derived from the results; it should 
be repeated that the criterion of fit in each case is the least- 
squares criterion. 

Calculation of the Coefficients A, 5, (7, ... . If the values of 
Pi, P 2 , Pa, . . . given in Eq. (5) are substituted in formulas 
for A, B, C, etc. [Eqs. (2)], the following values are 
obtained: 


A == 
B = 
C = 
D = 

E = 

(2 

F = 


N 






1 ) 

180 




N{N'‘ - 

2,800 

N(m - l)(iV2 - 4:)(N^ - 9) 

(S‘’» 

44,100 


sm - 


20 




N(N^ - 1){N^ - 4)(iV* - 9)(^'2 - 16) 


t*y 


- 13 
14 


2) /V + 

698,544 


3(iV* - 1)(JV* - 9) 
. 660 




N{N^ - 1)(JV2 - 4)(Ar2 - Q){N^ - 16)(iNr* - 25) 
15iV‘ - 230Ar* + ^ 


+ 


1,008 


( 6 ) 


In order to illustrate the algebraic procedure by which the 
above formulas are obtained, the formula for C will be derived, 
as follows: 

‘ rr _ Spay 
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]^2 __ I 

The formula for however, is — — > and hence 
N ^ 12 


S'- 


N(N^ - 1 ) 


12 

2^4 jift — I (3iV* — 7) 

Likewise, the formula for is — — • ■ — ^ — > and hence 


2- 


N- 12 20 

N(N^ - 1) (SN^ - 7) 


12 20 

Therefore the denominator of C becomes 

N(N^ - 1) (3iV* - 7) 2(N^ - 1) N(N^ - 1) (N^ - 1)* 

12 ■ 20 ' 12 12 144 ^ 


Taking N{N^ — 1)/12 out of ‘each term, 


iV(i\r* - 1) [SAT* - 7 2iN^ - 1) (AT* - 1)1 

12 [20 12 12 J 

which readily reduces to 

Nim - 1) (AT* - 4) _ N(m - l)(iV* - 4) 
12 ■ 15 180 


Thus C has ‘the formula given above. The formulas for th^ 
other coefficients can be obtained in the same way. 
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Equations (6) could be applied by using a work sheet with 
columns for the product terms indicated in order to obtain 
2 ^ 2 /, etc. Greater economy is obtained, however, by 

using the subtotal summation type of work sheet illustrated in 
the preceding chapter. By using such a work sheet, an expe- 
ditious method that involves only addition and is self-checking 
has been evolved for finding A, jS, C, . . . . A brief description 
of this method, together with the mathematical analysis that 
justifies its use, will now be given. 

a is defined as so that " 

N 

Na = Si (7) 


and a' is defined as equal to a. Accordingly, 

A , 22/ Si 

^-« = « -W-N 


From Eqs. (14), Chap. XXI, 
Si = 




and, by the definition of a, 

^ _N(N + 1) V. 

Si = — 2 — 2^ ty 

If /3 is now defined as 

2^2 , 


/? = 


then 


(3=0! — 


N(N + 1) 
2 


NiN + 1) 


ty 


(i) 


(ii) 


But Xty = Spiy, since pi — t; and if (8' is defined as 
r = 27^piy/N{N + 1), 

= a- 

and ■ 

(iii) 


0' = a- 
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Since B = it follows that 


" ~ 

Again, from Eqs. (14), Chap. XXI, it is found that 

_ (N + 1) ^ _ (JV" + 1) ^ <2/ + ^ <2y + ^2 

in which a and may be substituted for equivalents, so that 

2,. . Ar<. - (. - « + 2 .V 

N(N + 1) 
2 

^ _ N{N + iy ^ ^ N(N + 1)^ + N{N + 1) ^ ^ ^,y 

, - m+ „ + m . + w . ±?) ^ ^ 




Hence 


Therefore, making substitutions in the above value of 25*, 

as. , - + g „ + „ H- + W" + , 

+ ^p'y 
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„„ -ZN(N + 1)* + N{N + 1){N - 1) 

ZO* = — ct 

, N(N + 1){N + 2). 


P + ^Pty 


-N{N + l)(J\r + 2) , N(N + 1)(N + 2 ) , , V 

« + ^ ^ + 2^ 


Now, if 7 is defined as 


“ N(N + 1)(N + 2) 


2NiN + 1)(JV + 2) _ -N(N +1)(N + 2) 

6 6 “ 


^ 3iNr(iv + i)(isr . ±2) ,^V,., 


And if y' is defined as 7' = 


N{N + 1)(JV + 2) 


^ PiV, 


2y = —a + 30 + 7^ 
7' = a — 3/J + 27 


and since C = 


zri) S ' 


— - - jv-(iV* - 1)(A-* - 4) 

(7 = y' 

(N - 1){N - 2) ^ 

f 

In the same manner, it can be shown that if 

. 24 

~ N(N + 1)(JV + 2){N + 2) 


it follows that 


NiN + l)(Ar + 2)(iV + 3) 


8' = « - 6/3 + IO7 - 68 

n_ 140 

^ ~ (N - 1){N - 2){N - 3) * 
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As a result of the above analysis, from Eqs. (i), (ii), (v), and 
(viii), the following formulas are obtained: 


a 

/8 

7 

s 


e 

X 


§1 

N 

2 o 

NiN + 1 ) 

6 „ 

N(N + l)iN + 2) 

24 

N(N + 1)(N + 2)(Ar“+^ 

• 12 0 

N(N + l)(iV + 2)(iV + 3)(N + 4) 

720 

N(N + l)(Ar + 2) (AT + 3)(Ar + 4) (AT + 5) 



( 8 ) 


The values of € and of X are indicated by extension, since the 
symmetrical pattern of these formulas is readily apparent. 
The numerators run 2!, 3!, 4!, 6!, 6!, 7!, etc., and the denomi- 
nators run Nf N{N + 1), N{N + 1){N + 2), 


N{N + 1){N + 2){N + 3), etc. 

Similarly, from Eqs. (i), (iii), (vi), and (ix), the following 
formulas are obtained:^ 

a' = a ^ \ 

P' = a- 0 j 

Y = a - S0 + 2y ( , . 

y = a 6/5 + IOt “ 55 ( 

€' = a - 10/5 + 307 - 355 + 146 \ 

X' = a - 15/5 + 707 - 1405 + 126e - 42X / 

and from Eqs. (i), (iv), (vii), and (x), the following formulas are 
obtained; 


^ For additional equations, see Fisher, op. ciL^ or George W. Snedecor, 
StfUiatical Methods (1940), pp. 324-334, where the procedure is applied to 
problems of curvilinear correlation in which probability interpretation is 
valid. 
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A = 
B = 

C = 

D = 

E = 

F = 


a 





OV/ / 


(N - 1)(N - 2)(N - 3) 

630 

(N - l)iN - 2){N - 3){N - 4) * 

2,772 

(N - l)iN - 2){N - 3)(JV - 4)(iV - 5) 



( 10 ) 


Tables to Be Used in OrthogonaUpolynomial Analysis to Save 
Calculations, All the explanation necessary for the application 
of the method of orthogonal polynomials to a problem has been 
given. Thus, from a work sheet providing the series of sums 
S\, /S2, /S3, . . . , Eqs. (8) could be used to find the series a, /S, 
7, 5, ; from these, Eqs. (9) could be used to find the series 

• • • 5 from these, Eqs. (10) could be used to find 
the series A, B, C, . . . . The set of orthogonal polynomials 
fitting the data according to the least-squares criterion could 
then be written i/' = A + Bpi + Bp% + Cpz + . . . . From 
Eqs. (5), values of pi, p2, pz in terms of t could then be sub- 
stituted, and the final equation of trend in terms of t would be 
found. But it is desirable to effect another economy, by use of 
three tables of values that are the same for all problems having 
the same number of years of data. 

Thus, the use of Eqs. (8) will be greatly facilitated by the use 


of Table 93, a set of constants, 


N{N + 1) N{N -h \){N + 2) . 
”■*"2 ' 6 


etc., worked out for various odd values of iV, that is to say, for 
various numbers of years, from 11 to 41. The use of Eqs. (10) 
will be greatly facilitated by referring to Table 94 for the various 

N - 1 (AT - l)(Ar - 2) 


values of the series of constants 
(JV - l)(Ar ~ 2){N - 3) 


140 


etc. 


6 30 

And the use of Eqs. (5) will 


JV* — 1 

be made easier by referring to Table 95 for the values of — r ?: — f 


3A2 7 3Ar2 - 13 , 

, 


12 


20 


14 



Table 93. — ^Values op Specified Variables Dependent upon the Number of Years Included in Trend Calculation 

Odd numbers of years, from 11 <o 41 
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Table 95. — ^Values of Specified Variables Dependent upon the Number of Years Included in Trend Calculation 

Odd numbers of years^ from 11 to 41 
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Advantages of Method of Orthogonal Polynomials, The method 
of orthogonal polynomials is a great timesaver whenever a trend 
of higher order than a second-degree polynomial is fitted. While 
it has required several pages to describe the method, it will be 
noted that the actual solution of a problem requires* little more 
than a page of figures besides the work sheet. This is illus- 
trated in Chap. XXIV. 

But the saving of time is not the sole advantage of the method 
of orthogonal polynomials. In addition, the set of orthogonal 
polynomials that is obtained when values for A, J5, C, D, . . . , 
are obtained, that is to say, 

2/' = A -f- Bpi + Cp2 + Ppz 

constitutes the solution for any one of several trend lines. 
Thus 2 /' = A + Bpi is the straight-line trend; the addition of 
Cp 2 gives the second-degree polynomial trend; the addition of 
Dps gives the third-degree polynomial trend, etc. It is no^- 
necessary to recalculate values for A, D, (7, . . . , for the various 
trends required. If a problem has been worked out to include 
solutions for A, D, C, and D and subsequently it is decided that 
E is required, it can be found by adding one more column to 
the work sheet and finding the value of E without recalculating 
the values of A, B, C, and D. 

This convenience of obtaining several types of trends from 
one orthogonal set comes from the fact that the terms of the 
orthogonal equation are linearly uncorrelated with each other. ^ 


^ See p. 600. 



CHAPTER XXIII 

TIME-SERIES ANALYSIS— SEASONAL VARIATION 


Historical Background, The second major stimulus to the 
development of methods for analyzing time series, listed at the 
beginning of Chap, XX, was the troublesome effects of seasonal 
variations in economic activity. Writers on labor problems 
stress the evil effects for labor of wide seasonal fluctuations in 
some employments. The effects of seasonal variations upon the 
banking and credit system were emphasized during the nineteenth 
century and the early part of the twentieth century. Even as 
early as 1793, Alexander Hamilton advised that redemption of 
the public debt be carried on during the winter, for, said he, 
it is a familiar fact that during the winter in this country, there 
is always a scarcity of money in the towns — a circumstance cal- 
culated to damp the price of stock.^^^ 

Jevons made an analysis of the effects of the ‘^autumnal pres- 
sures^ on the London money market and calculated the average 
monthly fluctuations in currency movement between the Bank 
of England and its branches (1855-1862) and the average 
monthly excess of payments or receipts of British coin at the 
Bank of England for the same period.^ In 1890, George Clare 
analyzed the seasonal variations for the period from 1881 to 
1890 in the circulation of the Bank of England, in public deposits, 
•in other deposits, in '‘other securities,^^ in the " reserve, and 
in the "internal gold movements.s'^ 1902, J. P. Norton pub- 
lished a study of the New York money market in which he com- 
^ 28th Congress, Ist Session, Executive Document^ 15, p. 199. Cf. Myers, 
Margaret G., The New York Money Market, Vol. 1, Origins and Develop- 
ment, p. 208. Other early references to seasonal fluctuations are Hunt's 
Merchants' Magazine, Vol. 20, p. 302, Vol. 39, p. 582; Journal of Commerce, 
Aug. 3, 1846. 

* Investigations in Currency and Finance (Foxwell ed., 1909), pp. 158-159. 
Cf, Mitchell, W. C., Business Cycles — The Problem Stated and Its Setting, 
(1928), pp. 199, 236. 

« A M<me\pMarket Primer {2d ed.), pp, 19, 24, 31, 42, 53, 55. Cf. Smijh, 
James G., Benjamin H. Beckhart, and William A. Brown, The New York 
Money Market, AJpl. 4, External and Internal Relations, p. 424. 

• 617 
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puted the seasonal variation in loans, the peaks occurring on 
Mar. 4 and in July and December and the low points occurring 
at the beginning of the year, in May, and at the end of Novem- 
ber.^ The outstanding statistical analysis of seasonal variations 
in the New York money market before the First World War is 
that prepared for the National Monetary Commission in 1910 
by Prof. E. W. Kemmerer.^ In this study he analyzed seasonal 
variations in money rates, exchange rates, bond yields, currency 
movements, and deposits. His analysis brought out the sea- 
sonal relationships in a striking manner, in spite of very strict 
limitations in available data at the time. Much of his work is 
based upon data gathered by the questionnaire method. 

Causes of Seasonal Variation. Two types of underlying forces 
cause seasonal variations in economic activity: (1) climatic con- 
ditions giving rise to seasons in agricultural production, in out- 
of-door construction work, in the manufacture of clothing, in 
the use of fuel, and in traveling, etc., and (2) forces arising from 
convention, such as the Christmas and Easter trade and sea- 
sonal style convention.® The effects of these various basic 
seasonal influences upon the New York money market and upon 
the banking and credit structure of the United States have 
recently been exhaustively studied and published in Vol. 4 of the 
previously mentioned studies of The New York Money Market^ 
edited by Prof, Benjamin H. Beckhart of Columbia University.^ 

In large part the movement for banking reform in this country, 
which culminated in the studies of the National Monetary Com- 
mission and the Federal Reserve Act of 1913, was the result of 
the evil effects of sea&onal fluctuations in the demands of trade 
giving rise to periodical stringencies in the money market and fre- • 
quently initiating monetary panics. Consequently, it was one of 
the most important aims of the Federal Reserve System to devise 
an elastic currency and credit system that would accommodate 
these seasonal demands.® Thus banking reform in the United 

1 StcUiatical Studies in the New York Money Market, pp. 62-64. 

* Seasonal Variations in the Relative Demand for Money and Capital in 
the United Slates (National Monetary Commission Publications), Vol. 22. 

* Cf. Mitchell, op. ciU, pp. 236-240. 

* The New York Money Market, Vol. 4, External and Internal Relations, 
pp, 417-642. 

* The New York Money Market, Vol, 2, Sources and Movements of Funds, 
pp, 165-374. 
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States is a case in which along-recognized evil was finally statisti- 
cally measured and evaluated and a reform in the system definitely 
resulted in improvement. 

Not only in the field of banking has the study of seasonal 
variation by statistical methods been stimulated. In addition, 
unemployment with all its economic, social, and psychological 
implications has aroused great concern about the measurement of 
such variation. Extended reference to the problem of seasonal 
unemployment was made at former President Hoover's Con- 
ference on Unemployment, in the Report and Recommenda- 
tions of the Committee to Investigate Business Cycles and 
Unemployment. ^ 

In the hearings before the Committee on Education and 
Labor, of the United States Senate, in 1928-1929, much material 
and discussion are devoted to. the subject of the seasonal varia- 
tions in employment in industries and trade. ^ Franklin D. 
Roosevelt, when governor of New York State, appointed a Com- 
mittee on the Stabilization of Industry for the Prevention of 
Unemployment, which made its report to him in November, 
1930, entitled Less Unemployment through Stabilization of 
Operations, in which the subject of seasonal variations in 
employment constituted an important part. 

During the years leading up to the depression of the 1930's, 
much was written on seasonal variation in employment and its 
contemplated stabilization. Thereafter, the problem of cyclical 
unemployment and its solution by means of unemployment 
insurance and the entire social security program dominated the 
scene.* 

1 New York, 1923, pp. 6, 161, 215. 

®70th Congress, 2d Session, ‘‘Unemployment in the United States,** 
S.R.219. 

3 Smith, Edwin S., Reducing Seasonal Unemploymenty The Experience of 
American Manufacturing Concerns (1931). Douglas, Paul H. and Aaron, 
Director, The Problem of Unemployment. This book devotes pp. 73-118 
to the subject of seasonal variations and regularization of industry to 
stabilize such fluctuations. Hansen, Alvin H., and Tillman M. Sogge, 
Seasonal Irregularity of Employment in Minneapolis^ St. Paul and Duluth 
(Employment Stabilization Research Institute, November, 1931). Bbr- 
ridge, W. a., “Employment and Income of Labor in the United States,’* 
in InterruUiomal Unemployment (a study of fluctuations in employment tyid 
unemployment in several countries, 1910-1930, Industrial Relations 
Institute, The Hague, Netherlands, 1932). 
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A familiar example of seasonal activity in the economic sphere 
is construction activity, which gives not more than two-thirds 
as much employment in the winter months, on the average, as 
in the summer. Some important manufacturing industries, too, 
such as the automobile, agricultural implements, and ready-made 
clothing industries, show a considerable seasonal fluctuation. 
To be sure, the busy season in some industries comes in the dull 
season for others, a fact that tends to level out the differences 
between the number employed in industry in its entirety in 
one month as compared with another. But this does not mean 
that the workers released by one industry are absorbed by 
another to a sufficient degree or with sufficient promptitude to 
obliterate the variations from month to month in the amount 
of their employment. Barriers of specialized skill, geography, 
and attachment to particular occupations and localities prevent 
anything like the dovetailing suggested by the figures of the 
total number employed.^ Consequently, the statistics of total 
employment may show little seasonal variation, while at the 
same time large degrees of seasonal unemployment exist in many 
parts of the total. The fact that there is no seasonal variation 
or little seasonal variation in total employment does not solve the 
unemployment problem for the seasonally unemployed worker. 

One reason why concern, statistically speaking, about the 
subject of seasonal variations in employment has been stimulated 
is because the opinion prevails that this particular type of 
unemployment is in large part avoidable. The movement to 
inaugurate unemployment insurance in the United States. was 
partly based upon the belief that such a measure for the relief 
of unemployment would tend to regularize industries affected 
by seasonal unemployment. It is recognized that the greater 
problem of cyclical unemployment is less easily solved. The 
literature on the subject of unemployment insurance in the 
United States makes it clear that the movement is directed par- 
ticularly toward the regularization of industry to eliminate as 
much as possible of the seasonal fluctuation in employment.^ 

With these problems in mind, students of the labor problem 
asked: What types of business are responsible for the largest 

McCabe, David A., chapter on Unemployment, in Faciri^j the Facts (a 
syjnposium, 1032), pp. 324r-'325, 333-351. 

' * Cf. McCabe, op. cit.^ pp. 344-346, 350. 
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part of this seasonal irregularity in employment? What are the 
peak and slack seasons of employment in different businesses, and 
what are the amplitudes of fluctuations? What can be done to 
make business less seasonal and less irregular? What is the cost 
of regularization plans in an industry, and how do such costs com- 
pare with the savings resulting from more regular use of capital 
investment? These are the types of exceedingly practical prob- 
lems presenting themselves in this field of economics, and they 
have stimulated statistical research to take measurements of 
seasonal variations. They arc of practical significance to 
employers and to investors and to workers. They are of great 
social and psychological significance to the social scientist, the 
economist, and the political theorist. 

Methods of Measuring Seasonal Variations 

» 

It has been seen that the method of discovering trends either 
for their own sake (rational trends) or in order to remove them 
from the data, i.e., to get rid of them (empirical trends), has been 
based upon curve-fitting technique. The technical problem 
involved is a simple one even though the mathematics may be 
complex in some cases. The simplicity of the idea is somewhat 
offset, however, by the irrational character of the procedure. 
This is a troublesome factor because it is the function of the 
statistician not only to apply mathematical analysis to statistics 
but also to explain what he does and why he does it. Enough 
has been included in Chaps. XX and XXI, to indicate the 
general character of this problem. 

In the case of seasonal variation, the difficulties of the statis- 
tician are just the reverse; for while it has been possible to build 
up a perfectly rational procedure, upon the basis of the theory of 
averages, the technical problem involved has been found to be 
a complex one. The rational concept underlying the procedure of 
measuring seasonal variation is that, where a time series has a 
characteristic seasonal variation occurring year after year, it 
should be quite reasonable to depict a “typical,^' or average, 
seasonal variation for that time series. 

In its abstract aspect, therefore, the concept is perfectly 
rational. Hiomogeneous variates are to be averaged to obtain 
^ type. For example, it is proposed to average the amount^ 
by which Januejy data are higher or lower than those of other 
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months of the year, to average the amount by which February 
data are higher or lower than those of other months of the 
year, and so on, until a picture of the ^Hype'^ of periodical move- 
ment that occurs every year is obtained, although each year may 
be slightly different from the type. Moderate variations from 
the type are quite consonant with the theory of averages and 
their application to the problem of measuring seasonal variation.' 

When the rational procedure is to be put into effect, however, 
difficulties of a technical character arise. A time series of raw 
data that by a priori knowledge should have a distinctively 
regular seasonal variation may be selected. A graph of the time 
series is made, and a seasonal variation occurring every year is 
revealed, but the seasonal periodicity in the raw data is distorted 
by other movements, namely, trend and cycle. This was noted 
at the beginning of Chap. XX where a hypothetical time series 
was constructed. It is clear that the data in their raw state 
cannot be averaged to find the typical seasonal periodicity. 
That is to say, January, 1937, is not homogeneous with respect 
to seasonal variation with January, 1940, because the relative 
position of the respective Januaries (1) as to trend and (2) as to 
cycles is not comparable. In 'other words, averaging the raw 
data of all the Januaries in a series of data, all the Februaries, 
etc., for the 12 months of the year would be an irrational pro- 
cedure. This would not accord with the rational idea of seasonal 
variation outlined above because the averages of raw data would 
include averages of something in addition to seasonal variation. 

Problem of Isolating Seasonal Variation, To average the 
actual seasonal variation, it must be isolated from the other 
types of variation in the raw data. The technical problem 
involved in the measurement of seasonal variation is thus how to 
isolate from the raw data that part of its fluctuation that is 
essentially seasonal in character. When these other types of 

1 If the seasonal variation were measured weekly, rather than monthly, 
the principle would be the same. The same principle may be used to 
measure periodicity by days within the month or within the week; and it 
may likewise be used to measure periodicity by hours within the day. 
Thus periodicity by days within the month of wage payments might have 
great economic value for some problems; and periodicity by hours within 
the day of consumption of electrical power might have signi^cance in con- 
nection with some problems. Seasonal variation is only one type of 
periodicity that can be measured by this method. 
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fluctuation have been removed from the raw data, by subtraction 
or division, all the residual Januaries can be rationally averaged, 
all the residual Februaries can be averaged, etc., in order to 
obtain a picture of the average position, respectively, of each 
month. 

How can this be done? There are several answers to this 
question, and there is controversy as to just what is the best 
technical procedure. In his notable studies of seasonal variation 
made about 1910, Prof. Kemmerer devised a method for measur- 
ing seasonal variation separate from other fluctuations. At the 
time, it was the best that had been suggested.^ 

Another famous suggestion as to a method of isolating the 
seasonal periodicities from other types of fluctuations was made 
by W. M. Persons, when from 1915 to 1919 he developed his 
approach to the problems ob time-series analysis, culminating 
in the establishment of the Harvard Economic Society's business 
barometer and the’ Review of Economic Statistics. Persons' 
method, called the ''link relative method," expresses each 
monthly figure as a relative of the immediately preceding month; 
the seasonal pattern is found by averaging all the link relatives 
for the same month and taking any residual trend out of the chain 
relatives computed from these average link relatives.^ 

A third method of isolating the seasonal fluctuations and 
measuring them by an index of seasonal variation is that advocat- 
ing simply the removal of trend from the data and then the 
averaging of the monthly ratio differences from the trend.* 
While this method removes the nonhomogeneous effects of trend, 
it does not remove those due to cyclical fluctuations. If taken 
over a sufficient period of time, the bias of the cyclical fluctuations 
will cancel so that a true index of seasonal variations would be 

^ Op, cit. Cf, criticism of Kemmerer^s method by W. L. Hart, J ournal of 
the American Statistical Association, Vol. 17 (1922). Kemmerer's work 
constitutes an important pioneer effort to solve the technical difficulties 
involved and helped direct attention to better solutions. 

^Review of Economic Statistics, January, 1919, pp. 18-31; Indices of 
Business Conditions (1919). Cf, Rietz, H. L., Handbook of Mathematical 
Statistics, pp. 151—155. 

»Falkner, Helen D., '"The Measurement of Seasonal Variation," 
Journal of the*American Statistical Association, Vol. 19 (1924), pp. 167—179; 
Robb, Richard A., ‘‘Variate Difference Method of Seasonal Variation,’/ 
ibid,, Vol. 24 (1929), pp. 256-257. 
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obtained by averaging. One of the discoveries of recent years, 
however, is that seasonal variations change in the same time 
series from one era to another owing to new conditions; for such 
time series an index of seasonal variation based on a period of 
time covering two or more eras would be comparatively useless. 
Thus the criticism of the ratio-difference-from-the-trend method 
is that, if taken over a sufficient period of time to make it a 
valid measurement of seasonal variation, it would be taken for 
too long a time, i.c., that two or more eras of typical seasonal 
fluctuation might be confused. 

A number of other methods have been suggested, based 
upon the principles that have been outlined.^ The most widely 
used and probably the best method is the 12 months^ moving 
average method, of which a number of refinements have been 
suggested. Since this method is the one most extensively used, 
it is now described in detail and an illustration will be given. 

Twelve Months' leaving Average Method. This method 
consists of the following steps: 

1. Calculate a 12 months^ moving average of the raw data, 
centering the moving average at the seventh month; thus, 
opposite July of the first year would be the average of ‘the 
12 months of that year; opposite August would be the average 
of the last 11 months of that year and the first month of the next 
year; and so on. 

2. Divide the raw data serially by the 12 months^ moving 
average. Inasmuch as the moving average would contain 
in it the elements both of trend and of major and minor cycles, 
the residuals of the raw data from the moving average (either 
by subtraction or division) woucld contain purely seasonal 
fluctuations. 

1 King, W. I., “An Improved Method for Measuring the Seasonal Fac- 
tor, Journal of the American Siatieiical Association^ Vol. 19 (1924), pp. 
301-313; Carmichael, F. L., “Methods of Computing Seasonal Indexes: 
Constant and Progressive,'^ ihid.^ Vol. 22 (1927), pp. 339-354; Jay, Aryness, 
and Thomas Woodliep, “Use of Moving Averages in the Measurement of 
Seasonal Variation, ' ihid,^ Vol. 23 (1928), pp. 241-252; Baumann, A. 0., 
“Thirteen Months-Ratio-First Difference Method of Measuring Seasonal 
Variation," ibid., Vol. 23 (1928), pp. 282-290; Kuznets, Simon, “Seasonal 
Patterns and Seasonal Amplitudes; Measurement of Th^ir Short-time 
yariations," ibid., Vol. 27 (1932), pp. 9-20; Rigoleman, John R., and Ira 
N. Frisbee, Business Statistics (1932), pp. 226-242. 
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3. Make frequency distributions of the several months (see 
the example at the end of the chapter). 

4. Find the median relatives for each month, using the median 
to avoid the influence of extreme fluctuations. 

5. Express these median relatives as a percentage of their own 
average, thus giving an index of seasonal variation. 

As a. short cut, inasmuch as the result will be precisely the 
same, the 12 months^ moving total may be used instead of the 
12 months^ moving average, thus saving the division throughout 
by 12. 

Problem Illustrating Measurement of Seasonal Variation. 

Calculating the Index of Seasonal Variation by the 12 Months' 
Moving Average Method. The time series of monthly data on 
consumer installment-sale debt for household appliances in the 
United States has been selecte4 to illustrate the calculation of an 
index of seasonal variation by the 12 mbnths^ moving average 
method. Table 96 is a work sheet for the calculations necessary 
to the problem. The data were recorded on this work sheet for 
the years 1929-1942 by months, the raw data appearing in 
column (1). Next, a 12 months^ moving total was calculated; 
this appears in column (2), the moving total being ^ ^centered at 
the seventh month. For example, the figure 2,930 after July, 
1929, in column (2) of the work sheet is the total of the 12 
monthly figures for 1929; the figure 2,972 (opposite August, 
1929) is the total of the next 12 monthly figures, beginning with 
February, 1929, and ending with January, 1930. Opposite each 
July is the total for that year; this constitutes a good cross 
check in the construction of the moving total. 

To calculate the moving tojial, first put the 12 monthly figures 
for 1929 in the adding machine, and take a subtotal; then sub- 
tract the datum for January, 1929, and add the datum for 
January, 1930, and take a subtotal; then subtract the datum for 
February, 1929, and add the datum for February, 1930, and 
take a subtotal; and so on, until the end of the time series. 
Clear the machine, and then add independently the last 12 
months of the time series; this should check with your last 
subtotal. If it does not check, a mistake has been made, which 
can be most readily found by checking up on the July subtotals 
for each year, beginning with the last one and going back until^ 
you find the mistake. These subtotals are the 12 months^ 
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Table 96. — Work Sheet for Calculating Index op Seasonal Variation 
Data: Consumer installment-sale debt, monthly, for household appliances, 

end of month 
(In millions of dollars) 


Year and month 

(1) 

(2) 

(3) 

Monthly 
raw data 

12 months’ 
moving total 
centered at 

7th month 

Raw data divided 
by 12 months’ 
moving total, 
per cent 

1929: 




January 

207 



February 

199 



March 

199 



April 

217 

• 


May 

237 



June 

Tiilv 

260 

273^— H 

— — 9--9S0 

— • 9 32 

wUiy 

August 

274 

2,972 

9.22 

September 

272 

3,006 

9.05 

October * 

266 

3,031 

8.78 

November 

261 

3,043 

8.58 

December 

266 

3,043 

8.71 

1930: 




January 

249 

3,031 

8.22 

February 

233 

3,010 

7.74 

March 

224 

2,984 

7.51 

April 

229 

2,953 

7.75 

May 

237 

2,919 

8.12 

June 

248 

2,881 

8.61 

July 

252 

2,838 

8.88 

August 

248 

2,800 

8.86 

September 

241 

2,766 

8.71 

October 

232 

2,734 

8.48 

November 

223 

2,701 

8.26 

December 

222 

2,666 

8.33 

1931: 

r- 



January 

211 

2,628 

8.03 

February 

199 

2,588 

7.69 

March’. 

192 

2,548 

7.54 

April 

196 

2,609 

7.81 

May 

202 

2,471 

8.17 

June 

210 

2,434 

8.63 

July 

212 

2,397 

8.84 . 

August 

208 

2,357 

8.82 

September 

202 

2,318 

8.71 

October 

194 

2,276 

8.53 

•"November 

186 

2,225 

^ 8.36 

' December 

i 185 

1 2,167 

8.64 
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Table 96. — Work Sheet for Calculating Index op Seasonal Varia- 
tion. — (Continued) 


Year and month 

(1) 

(2) 

(3) 

Monthly 
raw data 

12 months' 
moving total 
centered at 

7<th month 

Raw data divided 
by 12 months’ 
moving total, 
per cent 

1932: 




January 

171 

2,101 

8.14 

February 

160 

2,028 

7.89 

March 

149 

1,954 

7.63 

April 

146 

1,881 

7.76 

May 

144 

1,813 

7.94 

June 

144 

1,749 

8.23 

July 

139 

1,685 

8.25 

August 

134 

1,628 

8.23 

September 

• 129 

^,575 

8.19 

October 

126 

1,528 

8.25 

November 

122 

1,484 

8.22 

December 

121 

1,447 

8.36 

1933: 




January 

114 

1,418 

8.04 

February 

107 

1,398 

7.65 

March 

102 

1,386 

7.36 

April 

102 

1,378 

7.40 

May 

107 

1,372 

7.80 

June 

115 

1,367 

8.41 

July 

119 

1,365 

8.72 

August 

122 

1,364 

8.94 

September 

121 

1,365 

8.86 

October 

120 

1,370 

8.76 

November 

117 

1,384 

8.45 

. December 

119 

1,403 

8.48 

1934: 




January ' . 

113 

1,422 

7.95 

February 

108 

1,441 

7.49 

March 

107 

1,456 

7.35 

April 

116 

1,468 

7.90 

May 

126 

1,479 

8.52 

June 

134 

1,490 

8.99 

July 

138 

1,502 

9.19 

August 

137 

1,515 

9.04 

September 

133 

1,528 

8.70 

October 

131 

1,544 

8.48 

November.? 

128 

1,561 

8.20 

December 

131 

1,578 

8.30 > 
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Table 96. — Work Sheet for Calculating Index of Seasonal Varia- 
tion. — {Continued) 



(1) ' 

(2) 

(3) 

Year and month 

Monthly 
raw data 

12 months' 
moving total 
centered at 

7th month 

Raw data divided 
by 12 months’ 
moving total, 
per cent 

1935: 

January 

126 

1,600 

7.88 

February 

121 

1,626 

7.44 

March 

123 

1,657 

j 7.42 

April 

133 

1,692 

7.86 

May 

143 

1,728 

8.28 

June 

156 

1,768 

8.82 

July 

1 164 

1,808 

9.07 

August 

168 

1,845 

9.10 

September 

163 

1,882 

8.93' 

October * 

167 

1,921 

8.69 

November 

168 

1,964 

8.55 

December 

171 

2,017 

8.48 

1936: 

January 

163 

2,075 

7.86 

-February 

158 

2,143 

7.37 

March 

162 

2,211 

7.33 

April 

176 

2,281 

7.72 

May 

196 

2,353 

8.33 

June 

234 

2,428 

8.81 

July 

232 

2,512 

9.24 

August 

236 

2,596 

9.09 

September 

238 

2,683 

8.87 

October 

239 

2,772 

8.62 

November 

243 

2,863 

8.49 

December 

255 

2,952 

8.64 

1937: 




January 

247 . 

3,045 

8.11 

February 

245 

3,130 

7.83 

March 

251 

3,217 

7.80 

April 

267 

3,302 

8.09 

May 

285 

3,382 

8.43 

June 

307 

3,451 

8.90 

July 

317 

3,503 

9.05. 

August 

323 

3,552 

9.09 

September 

323 

3,594 

8.99 

October 

319 

3,624 

8.80 

* November 

312 

3,639 

• 8.57 

December 

307 1 

3,636 

8^44 
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Table 96. — Work Sheet for Calculating Index of Seasonal Varia- 
tion. — (Continued) 


Year and month 

(1) 

(2) 

(3)' 

Monthly 
raw data 

12 months’ 
moving total 
centered at 

7th month 

Haw data divided 
by 12 months’ 
moving total, 
per cent 

1938: 




January 

296 

3,610 

8.20 

February 

287 

3,571 

8.04 

March 

281 

3,525 

7.97 

April 

282 

3,475 

8.12 

May 

282 

3,420 

8.24 

June 

281 

3,371 

8.34 

July 

278 

3,330 

8.35 

August 

277 

3,290 

8.42 

September 

•273 

3,253 

8.39 

October 

264 

3,218 

8.20 

November 

263 

3,183 

8.26 

December * 

266 

3,154 

8.43 

1939: 




January 


3,153 

8.12 

February 

250 

3,120 

8.01 

March 

246 

3,110 

7.91 

April 

247 

3,103 

7.96 

May 

253 

3,104 

8.15 

June 

260 

3,106 

8.37 

July 

265 

3,113 

8.51 

August 

267 

3,119 

8.56 

September 

266 

3,124 

8.51 

October 

265 

3,131 

8.46 

November 

265 

3,143 

8.43 

December 

,273 

3,161 

8.64 

1940: 




January 

262 

3,182 

8.23 

February 

255 

3,205 

7.96 

March 

253 

3,232 

7.83 

April 

259 

3,259 

7.95 

May 

271 

3,284 

8.25 

June 

281 

3,309 

8.49 

July 

288 

^,338 

8.63 

August 

294 

3,366 

8.73 

• September 

293 

3,397 

8.62 

October 

290 

3,430 

8.45 

November 

290 

3,474 

8.35 

December 

302 

3,523 

8.57 
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Table 96. — Work Sheet for Calculating Index of Seasonal Varia- 
tion. — (Continued) 


Year and month 

(i) 

(2) 

(3) 

Monthly 
raw data 

12 months’ 
moving total 
centered at 

7th month 

Raw data divided 
by 12 months’ 
moving total, 
per cent 

1941: 




January 

2J0 

3,572 

8.12 

February 

286 

3,620 

7.90 

March 

286 

3,672 

7.79 

April 

303 

3,721 

8.14 

May 

320 

3,764 

8.50 

June 

330 

3,794 

8.70 

July 

336 

3,805 

8.83 

August 

346 

3,809 

9.08 

September 

342 

3,808 

8.98 

October 

333 

3,794 

8.78 

November 

320 

3,749 

8.54 

December 

313 

3,670 

8.53 

1942: 




January 

294 

3,559 

8.26 

February 

285 

3,425 

8.32 

March 

272 

3,262 

8.34 

April 

258 

3,089 

8.35 

May 

241 



June 

219 



July 

202 



August 

183 



September 

169 



October 




November 




December 





Source: Hoi^haubbit, Duncan McG., “Monthly Estimates of Short>term Consumer 
Debt, 1929-1942/' Survey of Current Bueineee, Vol. 22 (November, 1942), pp. 9-25. 


moving total and can be tabulated in column (2) of the work 
sheet, as in Table 96. The next step is to divide each monthly 
raw datum by the corresponding moving total figure, expressing 
the answer as a percentage figure in column (3). The figures in 
column (3) are then tabulated in a system of frequency arrays* 
as in Fig. 147. , 

From Fig. 147 the median monthly relatives are read and 
arranged as in Table 97. 
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Fia. 147. — Frequency arrays, one for each month, of distributions of monthly 
ratios of raw data to 12 months’ moviag total. Column (3) of Table 96. Con- 
sumer installment-sale debt for household appliances in the United States, 
193&-1942. 

Column (1) of Table 97 consists of the median relatives read 
from Fig. 147; and these median relatives have only to be 

Table 97. — Index of Seasonal Variation in Consumer Installment- 
sale Debt for Household Appliances in the United States 


Month 

Medians 

Index of seasonal 
variation^ 

January 

8.13 

96.2 

February 

7.95 

94.0 

March 

7.82 

92.5 

April — ! 

8.05 

95.2 

May 

8.25 

97.6 

June 

8.72 

103.2 

July 

8.85 

104.7 

August 

9.10 

107.6 

September 

8.88 

105.0 

October 

8.64 

102.2 

November 

8.50 

100.6 

December 

8.55 

101.1 

Total 

101.44 

1,200.0 

Average 

8.4533+- 



^ This column consists of the medians expressed as percentages of their average. Thus 

8.13 is 96,2 per cent of 8.4533+. _ 

a • 

expressed as percentages of their own average to give the index » 
of seasonal variation. This is done, giving the figures in column 






Jan .Feb. Mar Apr. Ma^June July Auq.Sept.Ocf.NGv. Dec. 


Jan. Feb.Mar. Apr. MayJuneJuly Auq.Sepf .Oct .Nov. Dec. 


Fia. 148. — Studies in seasonal variations in commercial paper rates, before and 
since the establishment of the Federal Reserve System. 

tions and chance fluctuations. By averaging, the chance 
fluctuations are canceled out, leaving in the index a description * 
of relative seasonal movement. The theory of tjps method is 
t based, of course, upon the use of a 12 months' moving average; 
but precisely the same arithmetical results are obtained by using 
thecihoiing total instead, and it is a saving of a considerable mim- 










TmE-SEBIES ANALYSIS-SEASONAL VARIATIQN 633 


ber of division processes. In dividing by the 12 months^ moving 
total instead of by the 12 months’ moving average, the average 
residual percentage is 8.333+, whereas, in dividing by the 
* 12 months’ moving average, the average residual percentage 
would be 12 times 8.333+, or 100.00. 

From the multiple frequency array, as in Figs. 147 and 148, it 
can be* determined whether or not the seasonal variation is well 
defined. If the course of all the recorded ratios of raw data to 
the 12 months’ moving total by months tends to run close to the 
course of the medians, then the seasonal variation is a well- 
defined one. If, however, the points are scattered in a wide 
range from the medians and the general swing of the data does 
not correspond to the movements of the median line, then the 
seasonal variation is not well defined. Such a result might be 
obtained if the type of the seasonal variat^n were changing, and 
in that case the data may be studied in groups of a smaller num- 
ber of years. Figure 148 is included to present examples of 
poorly defined seasonal variation, as compared with well-defined 
cases of seasonal variation. The data studied are commercial 
paper rates in the New York money market before and after 
the inception of the Federal Reserve System. From the figure 
it is seen how well defined the seasonal variation in commercial 
paper rates was before the beginning of the Federal Reserve 
System — namely, for the periods 1904-1909 and 1909-1914. 
Also, it is seen how poorly defined is the seasonal variation for 
the periods 1920-1925 and 1925-1930 — so poorly that there 
could hardly’ be said to have been any consistent seasonal 
periodicity whatever.^ 

METHOD OF DETECTING CHANGING SEASONAL VARIATION 

Figure 149 is drawn to discover whether or not, during the 
years from 1929 to 1941, the seasonal variation in consumer 
installment debt for household appliances has changed.* The 

^ For a more complete discussion see The New York Money Market, Vol. ' 
4, pp. 510-530. 

^ For other suggested methods of measuring changing sea^sonal variation 
see Julius Shiskin, New Multiplicative Seasonal Index , Journal of the 
American Statistical Association, Vol. 37 (1942), pp. 507-516; Henry A. 
Latan4, ''Seai^nal Factors Determined by Difference from Average St 
Adjacent Months,” Journal of the American Statistical Association, Vol.* 
37^(1942), pp. 517^522; Dudley J. Cowden, ** Moving Seasonal Indexes,” 
Journal cf the American Statistical Assodaiion, Vol. 37 (1942), pp^ 52!^524. 
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figure is a plot of the ratios to 1? months’ moving total shown 
in the last column of Table 96; a separate graph for each month 
has been drawn in Fig. 149. 

The straight-line trends for each month were drawn in *^at ' 
sight”; they were not fitted by a mathematical method. Figure 
149 shows that the relative seasonal position of January, June, 
October, November, and December remained about the same 
during this period of years. But the relative seasonal amount of 
consumer installment debt was rising in the' months of February, 
March, April, and May, while the relative seasonal quantity of 
consumer installment debt was declining in the months of 
July, August, and September. 

Consequently, a more refined index of seasonal variation 
than the average of a period of years such as that shown in 
Table 97 and Fig. 148, can be obtained from Fig. 149. In fact, 
since these trends exist, a different index of seasonal variation 
for each year is required. For 1942 this index of seasonal varia- 
tion can be obtained as indicated in Table 98. 

Table 98. — Computation op Index op Seasonal Variation in Consumer 
Installment-sale Debt por Household Appliances, 1942 


Month 

Ratios, read 
from trend lines 
in Fig. 149 

Index of 
seasonal 
variation^ 

January 

8.10 

96.4 

February 

8.09 

96.2 

March 

8.08 

96.1 

April 

8.15 

97.0 

May 

8.35 

99.3 

June 

8.60 

102.3 

July 

8.65 

102.9 

August 

8.75 

104.1 

September 

8.65 

102.9 

October 

8.54 

101.6 

November 

8.40 

99.9 

December 

8.50 

101.1 

Total 

100.86 

1,200.0 

Average 

8.405 



> Obtained by expressing ratios in the first column as percentages of their own average. 




CHAPTER XXIV 
DETERMINATION OF CYCLE 


Usually it is desirable to have current figures on a monthly 
basis, and to know how actual experience compares with what 
should be expected for the season and with normal growth. 
Can we estimate our position in the business cycle from month 
to month? Annual data adjusted for trend and a picture of 
undistorted seasonal variation, illustrated in the preceding 
chapters, do not go quite far enough. It is often necessary 
to remove trend and seasonal variation •from monthly data in 
order to determine position in the cycle. 

Cycle Determined by Adjusting Monthly Data. When monthly, 
instead of annual, data are analyzed, the empirical trend may be 
found by setting up a work sheet similar to Table 83 or 86 (pages 
582, 688), depending upon the type of trend selected. The 
trend is then fitted by the method of least squares in a manner 
precisely similar to that demonstrated for annual data. Of 
course, if quite a number of years of monthly data are thus 
treated, the calculations become very extended, but the principle 
remains the same.^ It is possible, however, to derive an approxi- 
mation of tl^e monthly trend equation from an annual trend 
equation. This is explained in the present chapter and may 
serve as an economizer of time in the analysis of monthly data. 

Determination of Cycle in* Annual Data. While the purpose 
of this chapter is to present a method for measuring the cycle 
in monthly data, it may be noted at the start that even if the 
object of analysis is to determine cycle in monthly data it is 
desirable first to study the annual data. Not only is this* true 
because the monthly trend may be easily estimated from the* 
annual trend, but it is also desirable because the general character 

^ Where trends are calculated for monthly data, involving long series, 
convenient short-cut methods of calculation have been devised. Cf. lloss, 
F. A., “Formiilae for Facilitating Computations in Time Series Analysis'' 
Journahof the American Statistical Association^ Vol. 20 (1925), pp. 75-79; cf. 
also timesaving devices discussed in Chap. XXII, 

• 637 • 
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(• 

of the results on a monthly basis can be visualized from the 
analysis of the annual figures and the annual data can be analyzed 
with a much smaller amount of computation. The analysis 
on an annual basis will help judge the kind of analysis required ' 
for the monthly data, whether to use a straight-line trend or a 
second- or third-degree polynomial trend. In addition, the 
analysis on an annual basis will help to decide the significance 
of the respective trends. This will now be illustrated by making 
use of monthly and annual averages of monthly data on consumer 
installment-sale debt for household appliances in the United 
States, 1929-1942.* 

The work sheet is not reproduced here, but one similar to 
Table 91 (page 594) was constructed and the following set of 
subtotals was obtained: /Si = 2,842; /Sa = 18,159; Sz — 89,614; 
and /S 4 = 362,901 (in millions of, dollars). Using the method 
of orthogonal polynomials, which is the quickest method of 
finding at once the first-, second-, and third-degree polynomial 
trend lines by one set of calculations, the following results 
were obtained by the application of Eqs. (5) and ( 8 ) to ( 10 ), 
Chap. XXII (pages 605-612): 


a — 
0 = 
y = 
s = 


2,842 ^ 
13 

18,1 59 

91 

89,614 

455 

362,901 

1,820 


= 218.61538 
= 199.54945 
= 196.95384 
= 199.39615 


The values of the denominators in the above fractions were 
obtained from Table 93 (page 613). These calculations are 
carried to more places than are significant for the problem because 
they must be combined in multiple proportions to obtain the 
following results by using Eqs. (9) and (10), Chap. XXII 
(pages 611-612); 

^•Homhausen, Duncan McG., “Monthly Estimates Short-term 
Consumer Debt, 1929-1942,” Survey of Current Business, Vol. 22 (fTovem- 
ber, 1942), pp. 9-25, 17. 
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a' = 218.61538 
/S' = 19.06593 
y' = 13.87471 
= -6.12367 


A = 218.61538 
B = 9.53296 

C = 3.15334 

D = -0.64948 


Accordingly, the following set of orthogonal-polynomial 
trends is obtained; the first line is the straight-line trend; the 
first line combined with the second line is the sfecond-degree 
polynomial trend; the first, second, and third lines combined 
give the third-degree polynomial trend: 

y' = 218.61538 + 9.53296pi 
-I- 3.15334P2 
— 0.64948ps 

that is to say, since pi = t, pt = — 14, and pz = — 25t 

when N = IS (see pages 605 ftnd 615), • 

y' = 218.61538 -|- 9.53296< 

+ 3.15334(1* - 14) 

- 0.64948(<» - 251) 

The three possible trends are therefore the following (in 
millions of dollars) : 

Straight-line trend: 


y' = 218.6 -I- 9.533< (origin at 1935) 
Second-degree polynomial trend: 

• y" = 174.5 -f- 9.533< + 3.1531* (origin at 1935) 
. Third-degree polynomial trend: 


y"' = 174.5 + 25.771 + 3.1531* - 0.64951* (origin at 1935) 

Table 99 is presented to show the raw annual data and the 
annual values of each of these three trends, which are also 
presented graphically in Fig. 150. 

An annual increment averaging something less than 10 on a 
base of over 200 is not too great to deter the assumption that the 
straight-line trend roughly depicts rational long-term growth. 
If it is appropriate to suppose that installment-sale debt would 
tend grow at the rate of population growth, which is pfe- 
sumftbly geometric, the trend line fitted should be a logarithmit 
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Table 99. — Annual Trend Analysis of Consumer Installment-sale 
Debt for Household Appliances in the United States, 1929-1942 
(In millions of dollars) 


Year 

Raw data 

Straight-line 

trend 

Second-de^^ree 

polynomial 

trend 

Third-degree 

polynomial 

trend 

1929 

244 

161 

231 

274 


236 

171 

206 


1931 


180 

187 

163 



190 

174 

143 

1933 

114 


168 

141 

1934 

125 


168 

152 


151 

219 

174 

174 

1936 

209 

228 

187 


1937 

292 

238 

206 

233 

1938 

277 

247 


263 

1939 

259 • 

257* 


286 


278 

266 

301 

301 

1941 

317 

276 

345 


1942 


286 1 

396 1 

287 1 


Data available for only the first 9 months of the year; this year was not used in fitting 
the trends. 

t Extrapolated. 



Fig. 160. — Annual trend analysis of consumer installment-sale debt fot' house- 
‘ hold appliances in the United States, 1929-1942. 







DETERMINATION OF CYCLE 


641 


trend; but for a short period of years the straight-line trend is an 
adequate approximation of the more appropriate logarithmic 
trend. The straight-line trend in the data presently studied may 
consequently be used as the base from which the major cycle in 
the data can be measured. 

The second-degree polynomial trend shows a sharp rise for the 
later years, and if extrapolated beyond 1942 it would quickly 
approach infinity; it is not, therefore, a reasonable picture of 
rational growth. The straight line comes nearer to what would 
be the result if a logarithmic trend were fitted, if data covering 
a long enough period were available to afford sufficient per- 
spective to obtain a growth curve. 


Table 100. — Cyclical Movements in Consumer Installment-sale 

Debt for Household Appliances in the United States, 1929-1942 

• • ’ 


Year 

Raw data, 
millions of 
dollars 
y 

Straight-line 
trend, millions 
of dollars 

y' 

•_ 

Second-degree 
polynomial 
trend, millions 
of dollars 

y>r, 

Cycle, 
per cent 

tl 

y* 

W “ 100) 

Cycle mixed 
with residuals, 
per cent 

1. 

(y' « 100) 

1929 

244 

161 

274 

170 

152 

1930 

236 

171 

206 

120 

138 

1931 

200 

180 

163 

90 

111 

1932 

140 

190 . 

143 

75 

74 

1933 

114 

200 

141 

70 

57 

1934 

125 

209 

152 

73 

60 

1935 

151 

219 

174 

79 

69 

1936 

•209 

228 

203 

89 

92 

1937 

292 

238 

233 

98 

123 

. 1938 

277 

247 

263 

106 

112 

1939 

259 

257 • 

286 

111 

101 

1940 

278 

266 

301 

113 

104 

1941 

317 

1 

276 

302 

109 

114 


The third-degree polynomial trend seems adequatefy to 
represent the rounded contour of a major cycle. Examination 
of Fig. 160, accordingly, leads to the conclusion that the straight- 
line trend can be used to depict growth in the data, and the 
third-degree polynomial trend can be used to measure the major 
cycle.* As a consequence, the raw data divided by the straigSt- 
line trend should give a picture of the major cyclical movement 
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in this period (plus the residual fluctuations);^ and the third- 
degree polynomial trend divided by the straight-line trend gives 
a measure of the major cycle. Table 100 and Fig. 151 give the 
results of such computations. The column headed Cycle in 
Table 100 consists of the second-degree polynomial empirical 
trend divided by the straight-line empirical trend, giving as a 
result a smoothed measure of the major cycle. This is shown 



1930 1932 1934 1936 1938 1940 

Fig. 151. — Cyclical study of consumer installment-sale debt for household 
appliances in the United States, 1929-1942. 


with residuals in Table 100 consists of the raw data divided by 
the straight-line empirical trend, giving as a result ^the measure 
of the cycle mixed with residual fluctuations in annual data.^ 
Both these columns are expressed as percentages, with the y\ 
for each year equal to 100. 

Determination of Cycle in Monthly Data. Cycle Determined 
by Adjusting Monthly Data. Monthly data, when examined to 
discover the cycle, must be adjusted not only for trend but also 
for Seasonal variations. Adjusting monthly data for trend 
"^and seasonal variation in order to measure cyclical movements 

^ In addition, there might be minor cyclical movement, a fact that could 
be determined by further analysis of data extending over a longer period of 
time. 

'^See p. 570 for meaning of “residual fluctuations’’ in tifiie series. . In 
this instance, the residuals might include short-cycle ductuatioifs. See 
also Chap. XXV, pp. 659r-661. 



643 


, DETERMINATION OF 

Table 101. — Work Sheet for Calculating Monthly Index op Cycle 
Description op Data: Consumer installment-sale debt for purchase of 
household appliances, United States 

Source of Data: Survey of Current Business^ Vol. 22 (1942), pp. 9-26, 17. 


(1) 

(2) 

(3) 

(4) 

(6) 

(6) 

• \ 

Year arid 

4^h 

Raw 

data, 

millions 

of 

dollars 

y 

Monthly 

trend, 

millions 

of 

dollars 

]/* 

Index of 
seasonal 
variation, t 
per cent 
Av. = 100 

Monthly 
trend times 
index of 
S.V., millions 
of dollars 

/ X s.v. 

Cycle, 
per cent 

V 

y' X S.V. 

1940: 







January 


262 

262.0 

96.2 

252.0 

104.0 

February 


2.55 

262.8 

94.0 

247.0 

103.2 

March 


253 

263.6 

92.5 

243.8 

103.8 

April 


259 

264.4 

95.2 

251.7 

102.9 

May 


271 

265.2 

97.6 

258.8 

104.7 

Juno 


281 

266.0 

103.2 

274.5 

102.4 

July 


288 

•266.8 

104.7 

279.3 

103.1 

August 


294 

267.6 

107.6 

287.9 

102.1 

September 


293 

268.4 

105.0 

281.8 

104.0 

October 


290 

269.1 

102.2 

275.0 

105.4 

November 


290 

269.9 

100.6 

271.5 

106.8 

December 


302 

270.7 

101.1 

273.7 

110.3 

1941: 









290 

271.5 


261.2 

111.0 

February 

286 

272.3 


256.0 

111.7 


286 

273.1 


252.6 

113.2 


303 

273.9 


260.8 

116.2 


320 

274.7 


268. 1 

119.4 


330 

275.5 


284.3 

116.1 

July 

336 

276.3 


289.3 

116.1 


346 

277.1 


298.2 

116.0 


342 

277.9 


291.8 

117.2 


333 

278.7 


284.8 

116.9 


320 

279.5 


281.2 

113.8 


313 

280.3 


283.4 

110.4 

1942: 









294 

281.1 


270.4 

108.7 


285 * 

281.9 


265.0 

107.5 


272 

282.7 


261.5 

104.0 

April 

258 

283.5 


269.9 

95.6 

May. 

241 

284.3 


277.5 

86.8 


219 

285.1 


294.2 

74.4 

July •. 

202 

285.9 


299.3 

67.5 


183 

286.7 


308.5 

99.3 


169 

287.4 


301.8 

56.0 ^ 

October 







November 







December 








’^Equation of monthly trend: y' =■ 219.0 -h 0.796< (origin July, 1935). 
t Necessary t«f copy this for only one year; this seasonal variation was calculated <or 
illustrative purposes in Chap. XXIII, pp. 625-631. 
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involves the removal of trend and seasonal variation from the 
raw monthly data. The method is devised to produce an index 
or relative expression of raw data compared with what would be 
expected if only trend and seasonal variation were present in the 
data. Such analysis gives an answer to the question: How 
far are the raw data from month to month above or below what 
they would be if they were following the usual course of seasonal 
variation and the expected trend? In effect, the process is the 
reverse of that illustrated by a hypothetical series at the begin- 
ning of Chap. XX. The theory underlying the method of 
adjusting monthly data for seasonal variation and trend con- 
tains no tricks new to the student after mastering the material 
in the preceding chapters. It is quite simple in concept, though 
perhaps the arithmetical calculations involved are somewhat long. 

Illustration of Method of Deternpining Cycle in Monthly Data, 
It should be pointed out at the start that measuring the cycle 
in monthly data is measuring the same cycle as was measured 
in the annual data, if the same trend is used. In this illustration 
the raw monthly data will be adjusted for the straight-line trend 
and for seasonal variation, so that the resulting series of monthly 
data will correspond to the figures given in the last column of 
Table 100, except that they will be monthly instead of annual 
figures. The annual averages of the monthly adjusted data 
obtained in this illustration should be equal to the annual 
percentages shown in the last column of Table 100. 

Table 101 is a work sheet drawn up for the purpose of making 
the necessary calculations, on the assumption that '(1) the index 
of seasonal variation and (2) the equation of trend on an annual 
basis have been calculated. The d/ita used are monthly figures 
for consumer installment-sale debt for household appliances in the 
United States, 1929-1942. In the illustration only the period 
1940-1942 is presented. It would be a good laboratory exercise 
for the student to work out the rest for himself and plot the 
^jesulting adjusted figures. 

Column (2) of Table 101 contains the raw monthly data 
tabulated from the source.^ 

If a trend equation is in terms of annual figures, and the 
annual figures used are annual totals, the monthly trend equation 
mil be c 

1 Holthausen, op. cU , j 
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^ 12 144 ^ 

But if the annual figures are annual averages of monthly data, 
then the monthly equation of trend will be 


t I ^ A 

y = « + 12^ 


Thus, b must be divided by 12 because in the monthly trend 
equation the annual increment is distributed among 12 parts 
(t now stands for months instead of years). In other words, if b 
is the annual increment, 6/12 is the monthly increment. But if 
b is the annual increment of total annual data (sum of the 12 
months each year), then to put it on a monthly basis it is neces- 
sary first to convert it to a monthly figure by dividing by 12; 
it is then still an annual increment anfl has to be divided by 12 
again to obtain the monthly increment. 

In the trend equation, a can be assumed to be at June-July of 
the origin year, and of course the origin may be shifted by changing 
accordingly the value of a. The origin is at the middle of the 
year, f.e., between June and July. For example, the equation 
of trend found for the annual data on consumer installment-sale 
debt for household appliances is (in millions of dollars) 

yf = 218.6 + 9.63^ (origin at 1935) 

The data used in this illustration are annual averages of monthly 
data; so the monthly trend equation is (in millions of dollars) 

y' = 218.6 + 0.796^ (origin at June-July, 1935) 
in which unit of i is 1 month. 

By adding algebraically half a monthly increment to 218.6, 
the origin is shifted to July, 1935, and the approximate equation 
of monthly trend is as follows (in millions of dollars) : • 

y' = 219.0 + 0.796^ (origin at July, 1935) 

Solving this equation for different values of t (from ^ 64 at 

January,^1940, to i = 86 at September, 1942) gives the various 
monthly values of y' shown in column (3) of Table 101 under the 
caption Monthly Trend. Column (4)^ shows the index of fiea-* 



STUDY OF DYNAMIC VARIABILITY 


646 

f ’ 

sonal variation, which in Chap. XXIII was calculated by the 
12 months’ moving total method. Column (5) shows seasonal 
variation and trend combined by multiplication, and column (6) 
is obtained by dividing the monthly items in column (2), the 
raw data, by the monthly items in column (5). By this last 
operation, both trend and seasonal variation are removed from 
the raw data; the resulting index gives an idea of how high or,low 
the raw data are in comparison to what they might be expected 
to be according to usual seasonal variation and trend. 

Data thus treated over a series of years disclose information 
about the time series that it is not possible to visualize from the 
raw figures. It makes possible the comparison between cyclical 
and minor cyclical fluctuations in time series otherwise con- 
cealed by disturbing elements of seasonal variation and trends. 
If this monthly analysis of the data on consumer installment-sale 
debt for household appliances in the United States were done 
for the entire period 1929-1942, the picture of monthly data 
would, of course, resemble the broken line of Fig. 151. The 
annual averages of the monthly data, which contain cyclical 
movements mixed with residual movements, would be equal to 
the figures shown in the final column of Table 100. In this con- 
nection it is to be noted that the annual averages of column (6) 
in Table 101 are equal to the corresponding annual figures in the 
last column of Table 100; for 1940 the annual average of the 
figures in column (6) of Table 101 is equal to 104, and for 1941 
the annual average of the figures in column (6) of Table 101 
is equal to 114. 

From the results of calculations in Table 101 it may be con- 
cluded that consumer installment-i^ale debt for household 
appliances reached the peak of a cycle in May, 1941, remaining 
10 to 17 per cent above normal throughout 1941. In 1942 a 
sharp decline materialized; in fact, this decline, on a monthly 
basis, was rapid after October, 1941. The raw data appear to 
indicate that the peak of the cycle occurred in August, but this is 
"^ue to the effect of seasonal variation and trend. When sea- 
sonal variation and trend are taken into consideration, the 
cyclical peak is found to be in May, 1942. From July, 1940, to 
August, 1940, the raw data show an increase, but a cyclical 
declihe occurred in that period. Removal of seasonal variation 
•an(f trend makes it apparent that the appearance of a rise from 

c *■' 
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July, 1940, to August, 1940, was due to seasonal influences and 
trend. 

Adjustment of data by removing seasonal variation and trend 
makes it possible to judge quickly whether or not consumer 
installment-sale debt for household appliances is rising (or falling) 
more rapidly than seasonal variation and trend would lead us to 
expect. The resulting figures are frequently described when 
published by saying that the data are “adjusted for seasonal 
variation and trend.” Sometimes, if trend is unimportant or 
of dubious character, only seasonal variation is removed and the 
data are described as “adjusted for seasonal variation.” Charts 
of such data appear frequently in financial publications and in 
the financial sections of metropolitan newspapers. 

Measuring the Cycle Where Trend Is a Second- or Third-degree 
Polynomial, The rational growth of some data is better described 
by a second-degree polynomial, as discovered in Chap. XXI. 
When such is the case, it would be necessary to use a second- 
degree polynomial instead of a straight-line trend as illustrated 
in the preceding sections of this chapter. 

A third-degree polynomial is not likely ever to resemble a 
rational growth element in a time series, but it may resemble the 
conformation of the major cycle during a specified period covered 
by data that are being analyzed. If it is desired not only to 
remove growth trend but also to remove from the data the effects 
of the major cycle in order to observe residuals that might be 
significantly described as a short cycle, the method described in 
the preceding sections could be used with monthly data, apply- 
ing the same principles to the removal of a third-degree poly- 
nomial trend combined wiiji seasonal variation that were applied 
to the reqioval of a straight-line trend combined with seasonal 
variation. 

The general form of the second-degree polynomial annual 
trend is 2/^ = a + where ^ is 1 year. The equation of the 

monthly second-degree polynomial trend, where the annijgl. 
data are annual averages of monthly data, would be 


where t is 1 njoiith. 
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The general form of the third-degree polynomial annual trend 
18 y' = a + bt + ct^ + dfi where t is 1 year. The equation of 
the monthly third-degree polynomial trend, where the annual 
data are annual averages of monthly data, would be 

in which ^ is 1 month. 

If the data are annual totals, instead of annual averages, 
every item on the right side of the respective equations will be 
divided by 12.* 

Danger in Extrapolating Trends. In the illustration on page 
641 it was noted that the second-degree polynomial trend, if 
extended beyond the year 1941, would quickly go up to infinity. 
Thus the extrapolation, or extension, of this trend beyond 1942 
very soon becomes an absurdity. This shows the need for 
caution in the projection, or extrapolation, of empirical trends. ^ 
Their projection for short periods of time (how long depends 
upon the conditions of each particular case) is a valuable aid in 
constructing barometric indexes. 

A troublesome unsolved problem in time-series analysis is to 
know when trend is changing and also, for that matter, when 
seasonal variation may be changing. Neither the statisticians 
nor the economists have solved this problem, but they realize 
that it is ever present in time-series analysis. It is desirable, 
therefore, to be cautious about extending empirical trends intc) 
the future and to reexamine monthly data for seasonal variation 
at frequent intervals, A method for detecting changing trends 
in seasonal variation was explained and illustrated in the pre- 
ceding chapter. 

Method of Ratios vs. Method of Differences. In general, the 
method here presented for removing one or more types of varia- 
tion from time series has been the method of division, or ratios. 

other words, the raw data are e5l:pressed as percentages of 
computed trend and seasonal variation. This is not the only 
method of removing trend and seasonal variation from the 
monthly or annual raw data. Another type of approach is called 

* See pp. 644-645. ® 

* For further discussion in connection with economic forecasting, see 

Chap. XXV, pp. 661-671. ^ , 
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the method of differences/' which, in one of a number of forms, 
may be summarized as follows: 

1. Assuming that the index of seasonal variation and the 
monthly trend have been calculated by one of the conventional 
procedures, the monthly trend values are multiplied by the 
index of seasonal variation. 

2. •The trend multiplied by seasonal variation (i/p are now 
subtracted from the raw data {yi), 

3. This gives a series of yi - y[, 2/2 - 2/5, ... , being the 
arithmetical amount, in original units (pounds, dollars, etc.), 
by which the raw data are greater each month, or less, than the 
computed value for trend multiplied by the index of seasonal 
variation. These residuals of the raw data from trend and 
seasonal variation form a series ri, r 2 , rs, . . . , rn that would be 
a time series fluctuating arithmetical!;^ above and below zero, 
according to whether the raw data were above or below trend 
multiplied by seasonal variation, that is to say, according to 
whether the raw data were greater or less than values expected 
in view of the anticipated growth and seasonal fluctuation. 

4. These residuals are in terms of the quantity units of the 
raw data. Consequently, it would be very difficult to compare 
the residual fluctuations of a series measured in bushels (say 
wheat production) with the residuals in dollars (say the price of 
wheat). It is necessary to find a common denominator in order 
to compare the residuals in various time series, obtained by the 
arithmetical difference method. 

6. The (fommon denominator used is the s tandard deviation 
in the residuals. <rr will be simply \/tr^/Nj since their arith- 
metical average is zero. Each r divided successively by o-,. would 
give a seizes in terms of standard-deviation units that could 
thereafter be compared with other time series similarly treated. 
Various series, whether the original units were dollars, pounds, 
inches, etc., will now be reduced to terms of standard-deviation 
units and can all be plotted on the same scale, namely, a scale 
that is calibrated in standard deviations. 

One important disadvantage in the method of differences 
persists even after the residuals are expressed in terms of their 
own standard deviations. The residuals Avill tend to be arith- 
metically greater when trend is at high values and arithmetically 
small when ti;pnd is at low values. Thi^ means that the impifes- 
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sion will almost universally arise from such an analysis that as 
things grow they become subject to more violent fluctuations. 
In fact, when considered relative to the magnitudes from which 
variation occurs, these greater arithmetical variations may be 
really less important. The ratio method places at all times the 
proper proportional emphasis upon arithmetical fluctuations by 
expressing them as a ratio to the trend and seasonal variation. 

On the other hand, by the same token the ratio method may 
tend to minimize the importance of fluctuations; for it may be 
possible that the proportional amount of change is not so sig- 
nificant as the actual amount of change. For example, the fact 
that the amount of unemployment is no greater proportionally 
may not necessarily dispose of the fact that the actual amount of 
unemployment at some particular time is very great and the 
corresponding personal problems distressing in the extreme. 

Whatever method of statistics is used, it is necessary for the 
analyst to keep his eyes open to the effect the method itself may 
have upon his results. 



PART VI 
Forecasting - 

CHAPTER XXV 

THE ART OF FORECASTING WITH STATISTICS 
INTRODUCTION 

Prevalence of Forecasts. Ancient Origin of Pseudoscientific 
Forecasts. The human desire to look into the future led, even 
in ancient times, to the rise^of various forms of pseudoscie ntific 
forecasts.t^ Oracleg were frequently consulted as to the outcome 
of a contemplated military campaign, business venture, or love 
affair. Among the most famous of these was the Delphian 
oracle. Astrologists were, and still are, consulted for what the 
stars have to say; one of their most prominent devotees in 
modern times is said to be Adolf Hitler. 

It was partly to disprove some of these astrological notions 
that statistical method was first undertaken on a scientific 
basis. In the seventeenth century an idea prevailed that the 
phases of the moon inflqenced heal^; also, health was supposed 
to be critical every seventh year and life particularly hazardous 
at the age^ of forty-nine and sixty-threcv^ Near the end of the 
seventeenth century studies of vital statistics by Capt. John 
Graunt of London and Camper Neumann of Germany disproved 
the connection between health and the phases of the moon as 
well as the fateful significance of every seventh year in life^ 
Other similar superst itions were ^^de bimkgd ^^ by statistical 
studies# From the nBegmning of the history of the modern 
money marl^et, attempts have been made to devise some w^ 
to forecast the course of financial affairs. For the Antwerp 
Bourse in 1643, Christopher Kurz is said to have contrived an 
astronomical method of making prophecies about the money 
market.^ ^ ^ 

^ Ehrenberg, R., Capital and Finance in the Age of the Renaissance^ 

p. 240. * 
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Modern Scientific Forecosis.^ Although forecasting was thus 
once the special prerogative of soothsayers, today it has been 
placed upon a broader basis by the development of science. 
For one of the objectives of science is, precisely, to forecast. 
Science seeks to classify and determine relationships that may be 
used for purposes of prediction. Every science 
tain sense, a fotfifiast/' It foretells wliat will^ under cejrtain 
circumstances. The law of gravitation says, for example, that 
if a ball is dropped from a tall building it will fall with an acceler- 
ation of 32 feet per second per second. Boyle’s law says that 
the pressure in a given container varies, and will vary, directly 
with the temperature and indirectly with the volume. Scientific 
astronomy makes it possible to forecast the tides, to construct the 
calendar for our mundane affairs, and, in addition,^ to forecast 
celestial events such as the date of the next visit of Halley’s 
comet. There are no “ ifs ” or buts ” about themodern science 
forecasts in the realm of the natural or physical sciences. 

Popular Dramatization of Forecasts, The depression of the 
1930’s did more than hundreds of books could have done ^ make 
people cycle-conscious. So general was the interest in cyclical 
behavior that by 1940 the Foundation for the Study of Cycles 
was set up as a nonprofit organization with an international 
committee composed of scientists and businessmen. This 
foundation proposed to help in the task of integrating the work 
of the thousands of scientists and statisticians who are con- 
tributing in various fields to the study of cycles. Not only have 
cycles been found to exist in the realm of business activity, but 
scientists in many other fields believe they have discovered 
cyclical behavior in their respective studies. For example, 
psychologists have discovered that human beings have regular 
ups and downs in their emotional life, following a cyclical 
pattern. Biologists have discovered what appear to be regular 
fluctuations in animal, insect, bird, and even fish population^, 
jn 1937, Prof. William Hamilton of Cornell University, upon the 
basis of cycle studies, warned farmers and housewives of New 
York State to prepare for a scourge of mice in the winter of 
1939-1940 and for another outbreak in 1943-1944. While it 
majr still be too early to put the stamp of final scientific approval 
upon all these cyclical discoveries, they are nevertheless making 
iniportant contributions to knowledge. 
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^Some of the twentieth-century discoveries sound almost like 
the pseudoscientific superstitions of the Middle Ages. Thus in 
1943 the public was advised ^‘to look for skunks under your 
front porch about 1945.'' It was claimed that an answer could 
be given to such questions as: Will you feel happy or gloomy a 
month from today? Such statements were made as: If you 
are born in January, February, March, or April, the chances 
are you will live longer than people born in July, August, or 
September.^ These notions about forecasting suggest a precision 
in statistical forecasts that they probably will never possess.^ 
Scientific Forecasts. / A forecast, to be scientific, 
does not have to be unconditional; in fact, most forecasts in the 
realm of the social sciences and some in the realm of the physical 
sciences are hypothetical in character. Indeed, in its largest 
sense, forecasting must be taken to meaji prediction of not only 
what will happen but what would happen under given hypo- 
thetical conditions^ Not only must the predictions of the 
meteorologist and stockbroker be considered forecasts, but also 
predictions of the engineer as to the outcome of certain plans 
and the warnings of the economist as to the effect of certain 
proposed actions of Congress^tife forecasts. The latter are 
conditional forecasts. 

^^/Many predictions of coming events are hedged in by all sorts 
of weasel-like conditions. It may be said that private enterprise 
will disappear if Republicans are not elected. Or an economist 
may predict that a Congressional increase in tariff rates mil 
cause exports to decline, provided that foreign countries do not 
offset our higher tariff by giving bounties to their exporters, or 
that foreign demand for American products does not increase 
for some linforseen reason, or that American exports do not 
become less costly to produce. Such forecasts are conditional, or 
h^^thetical, in character. 

4^he practical worth of a forecast depends, not on wj^ether 
is conditional or unconditional, but on how much knowledge 
the forecaister actually has of the relevant conditions. An uncon- 
ditional forecast may be merely a wild guess and have little 

^ Dewey, Edward Russell, ‘‘Science Predicts the Future,^’ American 
Magazine^ Vai. 136 (1943), pp. 90-92./ ' • 

* lilhre Exact Forecasting and Less Exact Forecasting, pp. 659-6^, 
See also Smith apd Duncan, Sampling Statistic^ and Applications.^ 
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lvalue, in spite of its uncompromising and categorical appearance, 
/while a carefully drawn conditional forecast may be of great value, 
f in spite of its pussyfooting'^ aspect.' In the case of the latter, it 
may be that the likelihood of the conditional factors is very 
slight and that they are mentioned only to guard the forecaster 
from unwarranted criticism. On the other hand, if the disturb- 
ing factor has a fair likelihood of occurrence, the nature ^f its 
effect might be forecast so that the recipient of the forecast 
could be on his guard against this factor; by watching it, he 
might know when to abandon his faith in the original forecast^ 
For example, if a prediction of rain tomorrow is based merely 
on the fact that it looks somewhat cloudy today, the forecast 
would probably be of little value (in the sense that such forecasts 
would probably be wrong more often than they were right). 
In contrast, if a trained^ observer predicted rain after a thorough 
observation of the weather situation, this would have consider- 
able value even if he hedged his prediction by saying that the 
rain might not occur if the wind in a neighboring area shifted 
before a certain time. 

^ Qmlita^e JF orecasts. Most forecasts are 

qualitative^n' character. The meteorologist says it , will rain 
but does not always say how heavily. The economist may 
predict that the effect of an increase in the tariff will be to raise 
prices, but he does not often say to what degree. The meteorol- 
ogist, on the contrary, may give the approximate time^when rain 
is expected and how many inches are expected to fall; and the 
economist may try to estimate the average foreseen rise in prices. 
The latter would be quantitative forecasts. 

^It will be noted that forecasts may be quantitative in two* 
ways, with reference to the degree of the predicted change and 
with reference to the time of occurrence. The success of fore- 
casting must be judged, not only on the basis of whether the 
forecast was correct, but also on how far the forecast went in 
jtCtuaby describing the future event — ^its^^iantity and its timingf^ 
Illmtrations of Modern Forecasts, ^iST^tKeTiiMern^ 
forecasts are applied in many fields. Predictions of astronomical 
events, as already indicated, have been, among the earliest and 
most successful forecasts. The movements* of the moon, the 
planets, and other heavenly, bodies have -been computed with 
c&nsiderably accuracy ^so ,tliat their future cour^ may be pre- 
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dieted with great precision. Forecasts of certain eclipses, 
for example, have been only a few seconds in error in timing. 

this connection, •It is interesting to note that the theory of 
least squares was largely developed in the attempt to forecast 
the paths of the heavenly bodiesy 

Closely akin to forfiraftts have been f orecast s 

of weather condit ions. Short-range forecasts are basetf mainly 
on wind conditions and barometric pressures, but long-range 
forecasts are sometimes attempted from the study of rainfall 
data, sunspots, and the like. In some instances, studies of 
growth rings in old trees have yielded weather data going back 
many years. These studies usually look for cyclical fluctuations 
that will indicate periods of high and low activity and permit 
long-range forecasting. Studies of average weather conditions 
and the dispersion around ijbese averages also afford forecasts 
of the variability of conditions in different areas and hence 
suggest the more desirable airports and air routes. 

Engineers majiy . forecasts. A water-power engineer 

will forecast the amount of power to be obtained from a dam of 
given size built in a given river. Another engineer may predict 
the breaking strength of given kinds of wire at different tempera- 
tures. Still another may predict the maximum load to be 
sustained by a given bridge. 

From the laws of Mendel, biolo£is^s_m^ of 

results to be expected from crossbreeding. Agronomists will 
predict the average results to be obtained from the use of certain 
fertilizers, oV certain methods of cultivation, or certain varieties 
of crops. Agricultural economists attempt to predict the 
‘effect of certain sized crops on the future prices of important 
commodities or the effect of certain prices on the future volume 

f production. 

B i^iness economists attempt many kinds of forecasts. From 
studies of factors closely related to the sale of a certain p^pduct 
in a given area where the trade has been well established, for^r- 
casts may be made of the sales to be obtained from new untapped 
areas of similar character. Other economic forecasts aim to 
predict the ups* and downs of the b ^iness cvcle^ in various lines 
of activity.^ Probably the greatesfl^centage of economic fqfe- 
casts «are devoted to predictions of thf stock market — money 
ttm, bond privies, and secunt3LiaififiS.T , * 
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These are but a ^ew of th e examples of forecastin g.^ It is 
probably true that forecasting in all its ramifications is pandemic 
in modern life. 

Use of Statistics in Forecasting. This chapter attempts to 
outline the use that may be made of statistical analysis in making 
forecasts. Details of the methods of forecasting are beyond the 
scope of this volume, which is not a book on forecasting but 
merely includes a chapter on the pattern of methods used in fore- 
casting. A few examples will be given to illustrate these 
methods. The aim here is primarily to indicate the application 
of different statistical techniques to the problems of forecasting. 
-Statistics affords a basis for forecasting in two principal 
ways. (1) By studying mono variate and multivariate frequency 
distributions, statistics are used to forecast average results and 
the type and degree of dispersion around these results. (2) By 
means of time-series analysis, statistics are used to predict the 
course of events in tim^ Each of these applications to fore- 
casting will be discussed in the ensuing pages. ^ 

FORECASTS FROM DISTRIBUTION STTOIES 

/ Forecasts from Monovariate Distributions, (jf considerable 
data have been obtained, forecasts from monovariate distribu- 
tions may yield good estimates of the mean, standard deviation, 
coefficient of skewness, and kurtosis of the population from 
which the data were derived. If such is the case, these popula- 
tion estimates may be used to forecast the character of future 
samples drawn from the given populating / ** 

p ^Familiar matters relating to family care and health conven- 
tionally rely upon forecasting by uscrof mono variates. \ Suppose 
the frequency distribution, or monovariate, repi^esents the 
weights of boys of specified age. The mean of that distribution 
is presumably normal weight fpr that age; the standard devia- 
tion a^d kurtosis describe expected variability. fFvom such a 
flfionovariate and its statistics, it is commonly inferred whether or 
not a child is under normal weight and, if so, whether or not this 
deficiency is sufficient to cause alarm. Taken with other 
evidence it mav be the basis for the application of timely thera^ 
pei^tic action. ^ 

f / In social control, monovariate distributions are used to 
I stkndardize products involving the presumption of forecasting* 
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The ffct content and standard deviation in fat content of milk 
that has been produced and sold in the past constitute a set of 
standards according to which it is ruled that milk sold in the 
future will conform; thus milk is graded aceording to standards 
found by frequency-distribution analysis.; Methods of weight 
or content are used by the Bureau of Standards to set standards 
for many products, both in the raw state, such as grains of wheat, 
and in the final product, such as bread; and abnormal variations 
^rom these standards in the market are not permitted. 

^In business the use of monovariate distributions for fore- 
casting is widespread. For example, the distribution in sizes 
of shoes sold by a retailer is used as the basis for forecasting his 
future sales and for determining his reorders of additional shoes. 
In such forecasting, the businessman is interested in forecasting 
^ach class in the distributiopi rather than in the distribution's 
average and standard deviation. A similar forecasting pro- 
cedure is used by any retailer when he purchases articles that 
are sold by size, which include most articles of clothing. The 
wholesaler and the manufacturer also are interested in the same 
type of forecasting, so that they may profit by having the* 
appropriate number of articles of various sizes continually 
ready for the consumer — if the articles arc there, ready for him 
to buy when he comes, a minimum of consumers^ sales will be 
lost.i 

Forecasts from Bivariate Distributions. Bivariate data may 
likewise yield estimates of a bivariate population that may make 
it possible to forecast results of future samples. Suppose, for 
example, that it is found from a study of army records that there 
‘ is a high correlation between the Army General Classification 
Test scores, and the results obtained in a given electrical course. 
To be specific, suppose that the bivariate distribution of these 
two variables appears to be normal in form and it is estimated 

^ For another illustration of the use of monovariates in forecasting, see 
Robert J. Myers, ‘‘Comparison of Demographic Rates Assumed by tfen 
/National Resources Committee with Actual Experience,” Journal of the 
American Statistical Association, Vol. 38 (1943), pp. 201-209; also, for an 
example of such forecasting for the purpose of control of quality of manu- 
factured product, see William B. Rice, “Quality Control Applies to Busi- 
ness Administration,” Journal of the American Statistical Association, 
Vol. 38 (1943), pp. 228-232; c/. W, A. Shewhart, Economic Control of 
QualUy of Manufactured Product (1931). ^ 
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that the line of regression of electrical grades on Army General 
Classification Test scores is 

B = 5 + o.sr 

with a first-order standard deviation of seven points. From 
this it is possible to make certain predictions regarding the future 
results of men taking the electrical course. Thus, it might be 
predicted that men who got 90 on the Army General Classi- 
fication Test will get an average grade of 77 on the electrical 
course, half of them getting less than this and half getting more. 
If 70 is taken as the passing grade, it may be predicted that 
something like 84 per cent of the men whose score is 90 will 
pass the course, 70 being one standard deviation less than the 
average and the distribution being normal.^ 

Forecasts from Multivariate Distributions. Study of multi-v 
variate distributions may permit the same sort of forecasting as 
the study of bivariates. Suppose that a study of army salvage 
data shows that the length of service of wool socks is closely 
related to the amount of marching required of the troops and the 
average temperature of the area. Suppose, for example, that 
the plane of regression derived from the data is as follows:^ 

Length of life = 40 days — 3 (average miles marched per day) 

+ 2 (average temperature) 

# 

Then, if the average miles marched is increased by 2 miles per 
day, the Army may predict that the average length of life of 
wool socks will probably decrease by 6 days. Or again, if the 
troops are shipped to an area in which the average temperatuije 
is 10 degrees warmer, the average leqgth of life of the wool socks 
will be increased by 30 days. Or, in a third instance, if it is 
planned to send a force to a given area for which the average 
temperature is approximately 60 degrees and it is expected that 
the troops will march about 20 miles per day on the average. 

For illustrations of the use of bivaiiates ’for prediction in business, see 
Patricia Daly and Paul H. Douglas, ‘‘The Production Function for Cana- 
dian Manufactures,*^ Journal of the American Statistical AssodaUon^ Vol. 
38 (1943), pp. 178-186; also see pp. 674^676. 

* The correlation between length of life and temperature is assumed to 
be positive on the ground that, the warmer the climate, the less the use 
that would be made of wool and the more the use that would be nvade of 
coVton socks, See the disc^ussion on planes of regression, pp. 448-455. 
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then it should have sufficient supplies of wool socks on hand to 
provide for new issues every 100 days on the average.^ If the 
study indicated that the first-order standard deviation about the 
plane of regression was about 5 days, the Army might keep on 
hand a large enough supply to replace socks every 85 days (that 
is to say, 100 minus three times the standard deviation, or 
100 -r 15 = 85). 

Errors of Forecasts. In concluding this section on the use 
of monovariates, bivariates, and multivariates as forecasters, it 
must be noted tha^^forecasts of the kind indicated are necessarily 
inexact. They are based on the assumption that the population 
is exactly known. When the population characteristics them- 
selves have to be estimated, as they usually do, then the fore- 
casts based upon tl^e estimates will suffer from all the errors 
involved in the latter. > The more refined analysis that is required 
to take care of these errors of estimate of the population is 
beyond the scope of this book. ^It is sufficient here to point out | 
that the error of forecast based upon estimated population char- 
acteristics is greater than that based upon a known populatioiif 
For example, if a plane of regression based upon sample estimates 
has a related standard deviation of five points, the probability of 
a forecast based on the plane being off by as much as two times 
five points in either direction (therefore, 2a) will be, not the 
normal 5 per cent, but perhaps 10 per cent or more. Every- 
thing depends on the size of the sample from which the original 
estimates of the population characteristics were made. 

FORECASTING TRENDS WITH TIME SERIES 

"^More Exact Forecasting. If much is known about a par- 
ticular time series, so that *the nature of its growth and cyclical 
movements can be fairly well determined from rational con- 
siderations and if the remaining fluctuations are apparently 
random, forecasting from such a series can be put on much the 
same level as forecasting from distributions of the monov^riate, 
bivariate, and multivariate type discussed above. Carefal 
estimates may be made of the growth, and these may be extra- 
polated for a sjiort period of time into the future. Distribution 
analysis of the random fluctuations will determine the range of 

^ F#r the summer season this would be less than for the winter season, 
but a stock level based on a 100-day turnover might be taken as normal^* 
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fluctuations around the growth curve and will afford an estimate 
for the error of a forecast based solely on the growth element in 
the seriesJ^ 

Suppose, for example, that a logistic form of growth appears to 
be very logical for a certain type of data. If the data have 
reached a certain stage of development, the values of the next 
few periods may be forecast from an extrapolation of the logistic 
curve fitted to the past data. The amount of error in the fore- 
cast resulting from the random fluctuations around the normal 
growth may be estimated from the standard deviation of the 
residual fluctuations of the data from the fitted logistic curve. 

Illustrations of the type of time series that would permit 
fairly exact forecasting are afforded by Fig. 54 in Chap. V and 
Figs. 144 and 145 in Chap. XXL 

* Less Exact Forecasting. J The real difficulty in 'most time-, 
series analyses is to determine what is random and what is not 
random. Furthermore, it is often hard to work out any rational 
basis for specific forms of the trends and cycles.^ In cases 
where there is no particular trend indicated by the rationale of 
the situation, forecasts must be of a rough-and-ready sort, and 
little can be done to determine the error of forecast. 

/Economic time series are generally of the sort that do not 
permit more exact statistical forecasts.^ For this reason statis- 
tical analysis is usually only one of the elements entering into 
the making of economic forecasts. In some cases it plays a more 
important role than others, but nearly always the forecaster 
incorporates his statistical findings into a general appraisal of 
the situation. As indicated above, statistical analysis in these 
cases is itself largely intelligent guessing. The statistical part 
of an economic forecast is co^equently merely the quantitative 
ingredient of the final forecast, ly 

'Public utilities, especially the telephone companies, are 
keenly interested in the subject of forecasting growth or ;trend 
elements in time series. In the telephone business the laying 

^ See discussion on rational vs. empirical trends, pp. 550-565. 

*Tintner, Gerhard, ‘‘The Analysis of Economic Time Series,” Journal 
of the American Statistical Association^ Vol. 35 (1940), pp. 93-101. Wallis, 
W. Allen, and Geoffrey H. Moore, “A Significance Test for Time Series 
Analysis,” Journal of the American Statistical Association^ Vol. 36 (1941), 
PI» 401-409; Vol. 38 (1943), pp. 153-164. 
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out of plans and the construction of new exchanges necessitate 
some sort of forecast as to the future growth of the community. 
For years these companies have maintained elaborate and 
efficient research organizations whose business it is to forecast 
trends in growth of population as well as the geographical 
distribution of various types of business and residential areas 
in the oommunities served. 

Most business enterprises, however, are more concerned 
about cyclical fluctuations than about trend or growth in time 
series. For this reason the greatest number of published fore- 
casts have to do with the prediction of cyclical movements in 
business conditions. 

FORECASTING CYCLES WITH TIME SERIES 

y \ 

«A11 that has been said aboyt the inex|ictness of forecasting 
trends by the use of time series applies equally to the forecasting 
of cycles with time series. Nevertheless, ^he practice of relying 
upon statistics as an aid to business is now so prevalent that 
statistics, along with accounting, has become one of the standard 
tools and one of the essential means of internal control of nearly 
all economic enterprises, as well as a guide to public policies of 
governmental agencie^ Among its many commercial uses, 
business forecasting is one of the most important, and it is along 
this line that statistical analysis has been intensively developed 
in recent years.^ Today there are several important agencies 
that supply forecasting services. Among these are Standard & 
Poores CorpoTation, Brookmire Economic Service, Moody^s 
Investor's Service, Babson, and the Harvard Economic Society. 
In addition, many commercial banks such as National City, 
Cleveland Tryst, and Chase National include forecasts of prob- 
able future business trends in their monthly letters. Supple- 
menting these professional services are the statisticians and 
statistical departments of many large corporations, such as the 
American Telephone and Telegraph Company, which make ^ore-^ 
casts for their own use.- . 

American activity in this field has been internationally con- 
tagious. As early as 1921 the publication of the Economic 
Bulletin of th^ Conjuncture Institute was begun in Moscow; this 
publication was devoted to the study of business cycles anS 
to the analysis yf Russian statistics. Sujjsequently in nearly"^ 
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every important European country during the 1920^8 and 
1930's forecasting services were organized, sometimes by the 
large universities. The League of Nations showed its intention 
of inaugurating forecasting on a world scale by appointing a* 
Committee of Experts on Economic Barometers.^ 

The many possible occasions when forecasting is required in 
modern business can be shown by a few examples. trA com- 
mercial banker granting a loan must forecast the probability 
of its being repaid; his judgment in this respect will depend on 
his forecast of the borrower's future earning power; this, in turn, 
depends on his estimation of probable future price stability 
in the borrower's business. Similarly, a collateral loan will 
involve a prediction, more or less precise, of the future value 
of the security offered as collateral. A manufacturer needs to 
forecast probable sales and probable prices of his own goods 
and of materials he has to purchase, so that he can profitably 
plan production and plant expansion. A public-utility operator 
needs to forecast population and industrial trends, construction 
and operating costs, and probable prices for the service, in 
order to determine when and Avhere to build a railroad line, a gas 
main, a power plant, or a telephone exchange. 

All these things are commonplace in economic life, but«the 
growing complexity and interdependence of economic society 
have made it increasingly difficult for the average businessman 
to comprehend an existing situation in trying to formulate his 
programs for the future. He is not a statistical expert. His 
knowledge of methods of summarization and colnparison goes 
usually little beyond a vague comprehension of averages. To 
aid him, it is the purpose of the various business forecasting^ 
agencies “to provide the basis for business, financial, and security 
market policy. Regardless of the inevitable margin of error 
in every forecast, business, financial, or security market policy 
whiph is geared to only a fairly intelligent estimate of future 
^.probabilities is more likely to. be sound than is policy geared 
only to guess, or to no forecast whatsoever."* * 

1 Cox, G. V., An Apprainal of American Business Forecasts^ pp. 1-2. 

Forecaster’s View of Forecasting,” Standard Statistics^ (June 16, 
Ik93i), p. 14. Also see Bbatt, Elmee C., Business Cydes*nnd Forecasting 
(1941), pp. 736-800; Haedt, Charges 0., and GAEFistn V. COx, Fore- 
"^casting Business CcmdUii^ (1927), r 
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Forecasting General Business Conditions. ^ One of the most 
important objects of economic forecasting is to predict general 
business conditions, that is to say, the cyclical position of general 
‘.business. » Good times and bad times are such important 
elements in determining the prosperity of individual lines of 
activity and of individual firms that the prospect for general 
business is probably the first thing any corporation executive 
wishes to know. Statistically, general business is properly 
measured by some index of all business activity. It is the sum- 
mation of the whole and not merely one of the parts, although 
an index of a part, say an index of industrial production, may be 
taken as a barometer of the upswings and downswings of th^ 
Avhole. Such series are commonly called ^ ^ business barometers.’’ ^ 
forecasts of general business conditions are based upon one 
ot two forecasting methods or a combination of the two. The 
first method is known as “historical analogy,” the second as 
“crosscut analysis.’’^*^ 

•"The method of historical analogy is based on the assumption 
that in cyclical fluctuations history tends to repeat itself.^ In 
its cruder forms, this consists merely in forecasting the course of 
general business, subsequent to some disturbance, from the 
course of general business that followed a similar disturbance 
in the past. For example, the forecaster might undertake to 
predict the course of general business following the crisis of 1929 
from the course of business following the crisis of 1873. In 
more carefully thought-out form, historical analogy becomes a 
business-cycle theory that attempts to explain how the interplay 
of economic forces causes .general business now to rise and now 

/ Crosscut analysis proceeds on the basis that the business 
situation is never the same and that each new upswing or down- 
swing is the product of a set of factors different from those 
previouijly operative. To understand the given situation all 
the factors must be weighed as tg>their importance and a* net 
appraisal of the situation derive^. 

1 See pp, 530-535.^ 

‘For more elaborate classifications see Bratt, op, cit,, pp. 736-760; 
Hanby, L. H., Pu 9 in^ 9 s Forecasting (1931), p. 195; Day, E. E., ‘‘The R61e ^ 
Statistic^ in Bjusihesa Forecasting,” Journal of the Ameridhn SUUieUcal 
^Mociotton, Voi; ^^(1938), p. 2. 
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^ In good forecasting, both methods are employed. If a certain 
cyclical theory appears to constitute a good explanation of past 
events, it is good forecasting practice to consider it in predicting 
future cycle changes. Nevertheless,- careful study must be* 
made to see whether the role played in the past by a particular 
industry or economic development is subsequently being played 
by some other industry or development. The user of historical 
analogy must always, therefore, be on guard for changes required 
in the statistical embodiment of the cyclical theory on which hi^ 
analysis is based in order to keep it up to date in its assumptions. 
During the railroad era, statistics about railroads dominated 
the scene as good indicators . of general business conditions; 
later, it was statistics about automobile production; perhaps 
the time will come when it will be aircraft production. Again, 
the present era is often regarded as the ^4ron age.*' Statistics 
of iron and steel production are often used as barometers of 
general business conditions because so many of the products 
of the modern age depend upon iron. Perhaps the time will 
come when the emphasis will shift, from the point of view of 
statistics, away from iron and steel production to the production 
of the lighter metals such as aluminum. Who ean say when the 
world of business is changing from the one to the other? 

Reflection along the lines indicated in the preceding paragraph 
leads to the conclusion that •continuous crosscut analysis is 
needed as a means of verifying and justifying the use of the 
historical-analogy method, t 

Forecasting by Historical Analogy, ^One type of forecasting by 
historical analogy makes extensive use of the fluctuations in 
particular time series that appear to lead general business con- 
dition^ Examples of series that have been used as business 
barometers are indexes of stock-market prices, changes in unfilled 
orders of the United States Steel Corporation, machine-tool 
orders, and the loan-deposit ratio. ^ These series, it is argqed, will 
tend to lead changes in general business conditions, and important 
changes in general business conditions will first be made apparent 
by them. For example, % clear and 9 onsistent downswing in 
unfilled steel orders is presumed to presage a sknila^ downswing 
in general busing. Hence the latter is presumably forecasted 
fibm the former. In the case of the loan-deposit* ratio, is the 
•attainment of certain critical levels that is significant; when high 
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levels are reached, for example, (i.e., when loans are high relative 
to deposits) strained credit conditions are in evidence and a 
crisis will be forecasted. 

^More elaborate analyses making use of the historical analogy ' 
for forecasting combine several economic series.^ A well-known 
example of such a combination is that prepared by the Harvard 
Economic Society and published in the Review of Economic 
Statistics. While the society itself makes no forecasts from its 
statistical series, they have been found useful for such purposes 
and it is generally understood that that is what they are published 
forf These are shown in Fig. 162. 

^The Harvard series consist of three curves, known as the 
A, B, and C curves. The A curve represents speculation, the 
B curve business, and the C curve money. The actual data 
upon which these curves have been based vary from time to 
tim^. In those shown in Fig. 152, the curves are constituted as 
follows:^ The A curve, speculation, is based on the price of all 
securities listed on the New York Stock Exchange. The B 
curve, business, is based on bank debits in 241 cities outside 
of New York City. The C curve, money, is based on short-term 
money rates. In each of the constituent series the trend and 
seasonal variation were removed (when it ^\^s deemed appropri- 
ate) before the final indexes were computed. 

"^The theory that underlies the use of the Harvard curves 
for forecasting is that changes in speculation will generally 
precede changes in general business and that the significance 
of these changes will be more clearly understood when the 
course of the money curve is noted. A sharp rise in speculation 
at a time when money rates are lowland still falling would appear 
to forecast better business conditions. On the other hand, a 
fall in speculation at a time when money rates are rising would 
appear to forecast a declin'fe in general business. If coupled with 
a detailed crosscut analysis of the current business situation, 
these curves are found very helpful in predicting general business 
condition^ 

The Harvard curves are but one set of curves that have been 
employed in this attempt to forecast general business conditions. 
Various combinations of curves have been used. A number 

^Frickey, Edwin/* ‘‘Revision of the Index of General* Business Condi- 
TSons,” Review of Economy^ Statistice^ Vol. 14 (1932), pp.^80~87. 
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make use of capital issue by private corporations and capital 
expenditures of the Various government bodies. The idea 
behind the use of investment curves is that the volume of income, 
•and hence business, is largely determined by the volume of 
investment. 

As the result of a great amount of research work during the 
past tyrenty or twenty-five years, mostly under the auspices 
of the National Bureau of Economic Research or the National 
Industrial Conference Board, but also by scholars in the United 
States Department of Commerce,* increasing attention has been< 
given to methods of measuring business' conditions based upon 
quantity and distribution of national income. Instead of indexes 
of production, employment, volume of trade, arid the like, these 
new indexes attempt to measure national income and its distribu- 
tion, consumer expenditures and producer expenditures, savings, < 
capital formation, and the like.* Figure* 153 gives a picture of 
annual consumer spending, 1919-1942, showing indexes con- 
structed by Kuznets (National Bureau of Economic Research) 
and by the United States Department of Commerce.^ Figure 
154 is another illustration of the use of national-income statistics 
and their derivatives to show the cycle in general business 
conditions. This figure reproduces an index of that part of the 
national income devoted to expenditures for new durable goods 
and indexes of gross capital formation, net capital formation, and 
offsets to savings. The United States Department of Commerce 
index of private gross capital expenditures is presumably equiv- 
alent to Kuznet’s gross capital formation; to these are added 
indexes by Laughlin Currie reputed to measure income-producing 
Federal expenditures that offset savings and net government 
contribution to savings. The* index of expenditures for new 
durable gooSs is constructed by the Board of Governors of the 
Federal Reserve System. 

Tim^ and experience will reveal whether or not the national- 
income type of indexes proves to be better than the barofheter 
or over-all measure of business activity types. The national- 
income type has been made possible by the increasing amount of 

^Hoppenbero, RIarvin, and Mabel S. Lewis, ** Estimates of National 
Output, Distilibuted Income, Consumer Spending, Saving, and CapHal 
Formation,'^ Review of Economic Statistiesy Vol. 25 tfiay, 1943^, pp.* 107- 
174, 124. 
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Fig. 153. — Consumer Spending, 1919-1942. [From The Review of Economic Statistics^ Vol. 25 {May, 1943), p. 124.] 
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statistical data on income resulting from the administration of 
the Federal personal and corporate income taxes. 

^ Perhaps the greatest difficulty with all forecasting aeries is 
that the amount of the lead or lag is likely to vary considerably 
from time to time, so that the timing of the forecasted change 
becomes difficult.^ Another difficulty is to judge how great a 
change in the forecasting series must be before it is considered 
significant. The curve is almost bound to show minor ups and 
downs that are little related to general business. Presumably a 
movement either up or down must be great and persistent before 
any significant change is forecasted, but how great and how 
persistent is the question. The answer to this question is always 
easy to read ex post facto, but in following the forecaster from 
month to month this is more difficult. If a lead is short and 
data are not reported quickly, a given forecasting series, con^ 
sistent and reliable as it may be, is unlikely to have much fore- 
casting value since the change would be under way before it 
was manifested by the forecasting dat^ 

These difficulties apply in differing degrees to the various 
kinds of forecastersN In the case of the barometer type, which 
is ordinarily dependent upon one presumably indicating series 
such as unfilled orders of the United States Steel Corporation, 
the data are usually promptly available; but the minor ups and 
downs and the varying degree of lead and lag in the barometer as 
compared with general business constitute ever-present diffi- 
culties in their use. The indexes of general business activity 
based upon combinations of several series are less ’affected by 
difficulties with respect to lead and lag and minor fluctuations, 
but it is often difficult to find a combination of series that are 
^promptly reported. The national-income type of indexes suffer 
particularly from the fact that the data are not available cur- 
rently, except for estimates that are being attempted, and these 
are dcg)endent upon other types of data. 

unique type of forecasting hy historical analogy is employed 
by Roger Babson. The forecasting instrument is the Babson 
index of the physical volume of production. This covers manu- 
facturing production, mineral production, agrictiltural market- 


see^p. 674 for application of leads and I^gs to tbs forecasting of 
spedfio lines of business activity, in which it can be more successfully 
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ings, building and construction, railway freight, electric power, 
and foreign trade, ^he long-run trend of this curve is taken 
as normal, and the cyclical fluctuations are forecasted on the 
mechanical principle that a given action has an equal and 
opposite reaction. ^ Thus the area of a given period of prosperity 
will indicate, the area, but not the shape, of the coming depression. 
The slope of the depression area is forecasted to some extent 
with the help of other series and contemporary crosscut analysis 
of individual lines of activity: 

Forecasting by Crosscut Analysis, TEven if considerable 
reliance is placed upon certain forecasting series based upon the 
historical-analogy principle, it would seem desirable to supple-/ 
ment the analysis by a more detailed study of the current 
situation. This will help to time the forecast better. It will 
also assure the forecaster that the forecasting series continue to 
hold their theoretical significance in the eibb and flow of business. 
The great danger is that the business-cycle theory on which the 
forecasting series are based may become outmoded or may be too 
simple to be fully satisfactory as new conditions unfold. Cross- , 
cut analysis may possibly reveal these defects and help to 
remedy them.^ 

-^ Some believe that ^business cycles are unique and that the' 
roles played by various economic developments shift from cycle 
to cycle. If this were true, crosscut^analysis would be the only 
logical method of forecasting. •• Some general theory would neces- 
sarily have to underlie the forecast, even if it were the negative 
theory that •all cycles are unique. Nevertheless, *it is necessary 
to examine all the important sectors of the economy, to weigh 
their relative importance in the given situation, and to determine 
the net oqfcome. This requires compr^ensive surveys a^ 
shrewd judgment based on wide experience. ^ 

Such agencies as the Brookmire Economic Service, Standard & 
Poor's. Corporation, and Moody's Investor's Service generally 
follow the crosscut method. The Brookmire Economic Service 
watches carefully selected series, such as building construction, 
motorcar outpuj^and registration, exchange rates, and industrial 
employment. The importance attributed to the various series 
differs fron^time to time. Also, new ones are added and old ones 
discaided as the economic situation changes. In all cases wlfere 

^ Se^nal variations in individual series are e^minated. ^ 
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warranted, the Brookmire Economic Service distinguishes care- 
fully between basic trends, seasonal variation, and business 
cycles in its appraisal of the business outlook. The Standard & 
Poor’s Corporation also watches many lines of activity, and 
forecasts development in each line. The forecast of general 
business is mainly a summary of these many individual forecasts. 
Moody’s Investor’s Service likewise bases its general forecast 
upon many individual analyses. In making its forecasts, how- 
ever, Moody’s appears to be especially influenced by business- 
men’s anticipations of profits, a factor that receives much empha- 
sis in modern business-cycle theory. 

Forecasting Particular Lines of Activity. The same methods 
are used to forecast particular lines of activity as for general 
business conditions. Crude historical analogy, the use of 
leading series, and cross-cut analysis all play their roles. 

Crude Historical Analogy, Figure 155 is an excellent example 
of the use of crude historical analogy in forecasting the course of 
agricultural prices and of wage rates during a long and extensive 
world war. Here the course of agricultural prices and wages in 
the First World War is taken as the pattern for the expected 
course of agricultural prices and wages in the Second World War. 
From the proximity of the two series to each other until the 
beginning of 1943 it would seem that the forecasting power of the 
former series is relatively high. This method is of greater value 
in forecasting particular lines of activity than it is when applied 
to general business conditions, although crosscut analysis might 
modify judgment of this forecast by pointing out that the efforts 
at price control and inflation control in the Second World War 
appear to be more courageous than they were in the First 
World War.'^ 

Lead-Lag Relationships. Figure 156 illustrates the lead-lag 
relationship in forecasting hog production. In raising hogs the 
principal cost is the corn on which the hogs are fed. Further- 
more, the.ratio between the ampunt of corn fed to a hog and his 
weight is fairly constant. HencP, the profitability of hog 
raising is essentially indicated by the so-called ‘corn-hog dif- 
ferential,” which is the difference between the pricie of 100 pounds 
of hogs and the cost of enough corn to raise 100 pounds of hogs. 
As^ this differential increases, hog production becomes « more 
p^ofltf^ble; m it decreas/ss, hog production becoihealess profitable. 
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The increase or decrease in profitability affects hog production 
with several months lag. Hence, changes in the corn-hog dif- 
ferential can be used to forecast changes in hog production, as 
shown in the figure. 
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WAOBB OF^ABH LABOB AMD BWLDiMQ TBADMB, AMD SALABIBB OF 7BA0HBBS AND OLBBIOAL BMFLOYBBB 
^AJDAOBfBD BOB BBABOMAL WABtAJtOM 


Fia. 155. — Prices received by farmers and composite wage rates. Intfex num- 
bers, United States, 1914-1920, and 1939-1943. [From The Agricultwal 
SitUeUiorif {May^ 1943), p. 8, published by the Bureau of Agricultural Economics, 
^United States Department of Agriculture.] 


' The Cycle H*ypothesis. The lag of hog production behind the 
^ corn-hog differential not only permits forecasting of the foyner 
but ilso tends to cause periodic upswings and downswings*in the 
.two series. T?he reason for this is as foUows: Suppose that flie 
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demand for pork increases and, owing to the inability to increase 
rapidly the production of hogs, the corn-hog differential rises. 
This makes hogs more profitable to produce, and their number is 
gradually increased. The lag in response, however, may cause 
the differential to go higher than it would otherwise, and this in 
turn might stimulate a greater increase in production than is 
required to meet the new demand that caused the original rise 
in the ratio. When this enlarged production comes on the 
market, prices fall and the corn-hog differential drops. Owing to 



1901 1905 1910 1915 1920 1925 1930 1935 1940 


^AVeHACe PRICES OF PACKER AND SHIPPER PURCHASES AND NO » YELLOW CORN 
A IM^MONTH HOVING A VERACB OP FEDERALLY INSPECTED HOO SLAUGHTER 

Fio. 166. — Hog-corn price ratio and hog marketings, 190l~1942. {From Bureau 
of Agricultural Economics^ United States Department of Agriculture,) 

the greatly increased supply, prices go lower than their natural 
level and hog production becomes 4ess profitable for a while. 
The change in profitability causes hog production to drop off, 
and ultimately prices tend to rise again, completing the cycle. 

This existence of a periodic movement in the corn-hog dif- 
ferential and in hog production permits forecasting for ’some 
distance into the future. If a great war does not interrupt the 
normal course of economic forces, the high and low periods in 
the com-hog differential can be predicted with a fair degree of 
accuracy. Wise hog faripers gain considerably from this long- 
range forecasting. Similar periodic movements tend to appear 
in other Uhes of agriculture in which production' lags beWd 
pTbce stimuli. For example, the cattle cycle runs about fifteen 
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years, according to studies made by the United States Depart- 
ment of Agriculture. 

Crosscut Analysis, The application of crosscut analysis to 
particular lines of activity is based in many instances on the 
analysis of supply and demand conditions. In agriculture, the 
carry-over and current crop prospects are important factors 
on th^ supply side. The economic condition of industries or of 
sections of the population using the given product, the prices of 
competing products, and the output of competing areas are 
important factors on the demand side. If the product has 
widespread uses, possibly prediction of changes in consumer 
incomes or in general industrial activity might be the best way of 
forecasting the future demand for it. 

In manufacturing, principal attention is likely to be devoted 
to demand. When the demand is industrial, the forecasting 
takes primarily the form of predicting conditions in those lines 
of activity immediately supplied by the given line of manufac- 
turing. Thus steel production might be forecasted from railroad 
construction and maintenance, automobile production, road 
construction, and building activity. When the product is one 
sold to the consuming public and not to other industries, the 
analysis of demand becomes largely a study of the flow of income 
to consuming areas. This will be dependent on the prosperity 
of important industries in these areas and on the net flow of 
incomes from outside sources. The prices of competing products 
wi^ also be an important demand factor. 

< A statistical technique using multiple and partial correlation, 
mathematical and graphic methods, has been developed for 
making crosscut analyses si^ch as those suggested in the two pre- 
ceding paragraphs. This technique is widely used; in the case 
of many products the multiple- and partial-correlation technique 
makes it possible to derive demand curves that will forecast with 
consi4erable accuracy the amount of change in sal^ that would 
accompany a given contemplated change in price. ^ • 

FORECASTS WITH SEASONAL VARIATION 

^ Forecasting with seasonal variation is probably the oldest of 
all types oi^modern forecasting and is so general as to be common- 

• • * • 

^ Cf, Schultz, ^Henry, The Theory and Measurement of Demand (1938Jrf 
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place. It is applied to particular lines of activity more speci- 
fically than to general business conditions.* 

^ Historical Analogy, ^Use of , historical analogy for forecasting 
with seasonal variation is simpler and more dependable than the 
use of historical analogy for cyclical or trend forecasts. The 
conditions underlying persistent seasonal variations are more 
readily analyzed than are the rational explanations of cycles and 



H m FEB MAR RPR MRY JUN JUl RUG SEP OCT NOV DEC 

/942 7.9 8.0 8.3 7.7 7.1 6.T 6.8 6.9 6.5 7.2 7.2 7.5 

/94S 8.5 6.5 8.5 8.5 83 7.4 7.4 7.1 7.1 


Fiq, 157. — Mortality from all causes. Metropolitan Life Insurance Company 
industrial department, weekly premium-paying business. [From the Statistical 
Bulletin, Vol, 24 {July, 1943), p. 12, published^hy the Metropolitan Life Insurance 
Company.] 

trends. Moreover, the forecasting is for a shorter period into 
the future and can therefore depend upon conditions remaining 
approximately unchanged pending the outcome of the forecasted 
events. ^ Statistical techniques have been developed for nieasur- 
ing the dependability of a given seasonal variation. 

Figure 157 illustrates the extrapolation of seasonal variation, 
which is the use of historical analogy for making a forecast with 
seasonal variation. From the figure it can be forecast, by 

"»See pp. 631-636. 
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I 

assuming a continued agreement between 1942 and coming 1943 
seasonal movement in mortality from all causes, that the Sep- 
tember death rate per 1,000, annual basis, will be about 7.25, 
the October and November rate about 8.25, and the December 
rate about 8.50. 

Figure 158 is an application of the use of forecasting seasonal 
variation by historical analogy to the field of agricultural eco- 
nomics. Extrapolation of the 1943 curve predicts that income 
from farm marketings in the South Central region of the United 
States will fluctuate around 200 million dollars monthly until 



Fio. 158. — Cash income from farm marketings 1942-1943 compared with 
1937-1941 average. [From The Agricultural Situation, Vol. 27 {June, 1943), p. 8, 
published by the Bureau of Agricultural Economics, United States Department of 
Agriculture. ^ 

• 

July or Ahgust; thereafter, monthly cash income from farm 
marketings in that region will rise sharply to a peak in October 
of perhaps 500 million dollars or higher, since the 1943 level 
appears tc^ be a higher average than that of 1942. This figure 
shows the annual average seasonal movement, 1937-1941, which 
gives a somewhat more dependable seasonal indicator than a 
single year's figures. ^ 

Combining Seasonal with Cyclical Forecasting. Whenever it 
is desired to make forecasts on the basis of a period shorter than 
a year, it is necessary to apply a seasonal forecast along with 
cyclical forecasting. In the case of conventional forecasting 
by the use*of business-cycle studies and the resulting baromeJbers, 
general business indexes, and crosscut analysis, discussed in a 
preceding section of this chapter, shorj-period forecasts bafl(&d 
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upon known seasonal variations are used as well as the cyclical 
forecasts. ^ 

Many illustrations could be found of the application of this 
combination of seasonal with cyclical forecasting. Figure 159 
is an illustration in the field of agricultural economics. Based 
upon statistical forecasting of the cycles in production of live- 
stock, similar to the cycle analysis of hog production already 
outlined, the levels of livestock marketings for 1943 and 1944 



Fiq. 169. — Transportation loads for livestock, estimated on basis of indicated 
marketings and shipments from public markets, United States, January, 1941- 
March, 1944. [From The Agricultural Situation^ Vol, 27 {Fehrualy^ 1943), p. 8, 
published by the Bureau of Agricultural Economics^ United States Department 
of Agriculture.] 

u 

are forecast. The annual amount is then distributed throughout 
the months of the year according to the predetermined index of 
seasonal variation. The figure presents the resulting forecast 
of monthly transportation loads for livestock, estimated from 
in^caled marketings and shipments from public markets in the 
United States. On the same figure are shown the actual amounts 
monthly for the years 1941 and 1942, for purposes of comparison. 
Figures similar to this one for various lines ofi industrial and 
manufacturing activity appear frequently in such publications 
as the Surrey of Current Business and in the publiqa&ons pf the 
Various forecasting agencies. 
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THE QUALITY OF FORECASTING 

The success of forecasting is hard to judge. First it is to be 
noted that if an agency declines to make forecasts in difficult 
situations and makes rather limited forecasts in general, it is 
likely to have less failures than one that boldly undertakes to 
foreoast on all occasions and in considerable detail. ^The success 
of a forecasting agency should be judged according to what it 
attempts to do. ^ 

/The success of forecasting should also be judged in the light 
of what might be accomplished by mere random guessing. In 
other words, an agency should be right at least 50 per cent of the 
time, or it is worse than useless, y Judged on these bases, the 
various economic forecasting agencies have been fairly successful 
• in forecasting general business conditions. Although not 
registering anything ne§^ a perfect score, they have at least 
been better than chance. 

\/ One of the chief problems of economic forecasting lies in , 
the effect of the forecast itself. The effect of the forecast may 
conceivably be such, on the one hand, that the forecast actually 
causes the forecasted event to occur, or, on the other hand, that 
the forecast actually prevents the forecasted event from occur- 
ring. Whether or not such untoward results are produced 
depends largely on how widely the forecast circulates^ On the 
one hand, suppose a forecasting agency predicts a general infla- 
tion of prices and enough people become convinced that the 
forecast is hr true one; in such a case, the forecast may not only be- 
come true but be itself the cause of the thing that Ls forecasted^ 
On the other hand, a subscriber to a forecast expects to profit from 
its use, in i;hat his plans will anticipate probable future conditions 
of which a competitor is supposedly not so well informed. The 
fewer who have this information, the more likely it is that they 
will .profit and that the forecast will be a true one. But the 
wider the acceptance of the forecast, the less chance ^he indi- 
vidual subscriber has to gain and the less likely is it that*the 
forecast will prove to be true. ^Suppose, for example, that a 
forecasting agency advises its clients in a given productive 
activity tjiat the price of its product is going to rise as a result of 
sonipe foreseen increase in demand; if loo many of the pnodhcers 
obtain the forecaster's service and follojv its advice, overprodiuc- ' 
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tion will result and the price will decline rather than ^ise. TJiis is 
an illustration of how a forecast might defeat itself. 

/ In the final analysis, it may be said that the greatest value of 
modern forecasting work lies in the large amount of statistical 
economic analysis that it promotes. Research into the business 
cycle and continued improvements in the statistical approach to 
social and economic problems cannot fail to reveal closer and 
closer approximations to the truth and thereby improve general 
knowledge about economic and social conditions^ 
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Table I. — Four-place Common Logarithms of Numbers^ 
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0792 

0828 

0864 

0899 
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1399 
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1461 
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1673 
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2041 
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8 
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1 

3 

4 

6 

7 
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1 

3 

4 

5 

7 
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1 

3 

4 

5 

6 
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1 

3 

4 

5 

0 

3.5 
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1 

2 

4 

5 

6 

3.6 
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1 

2 

4 

6 

6 
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3 
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1 

2 

3 

4 
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6170 
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1 

2 

3 

4 

5 
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1 

2 

3 

4 

5 
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1 

2 

3 

4 

6 
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6474 
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1 

2 

3 

4 

5 
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1 

2 

3 

4 

6 

4.6 
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6637 
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1 

2 

3 

4 

5 

4.7 
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1 

2 

3 

4 

5 

4.8 
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6848 
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1 
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4 

4 

4.9 
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6937 
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1 

2 

3 

4 

4 

• 
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1 
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3 

4 
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1 

2 

3 

3 

4 
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1 

2 

2 

3 

4 

5.3 
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1 
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3 

4 

5.4 
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e 
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7356 
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1 

2 

2 

3 

4 


^ TaScen, with permission^ from E. V, Huntington’s Four Place Tables of Logarithms and 

Trigonovielric Fundj^ona (Harvard Cooperative Society, Inc., 1907), •* 
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Table I. — Four-place Common Logarithms op Numbers. — 

{Continued) 
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2761 

2763 

2765 

2758 

2760 

2762 

2766 

1.89 

2765 

2767 

2769 

2772 

2774 

2776 

2778 

2781 

2783 

2786 

2788 

1.90 

0T2788 

2790 

2792 

2794 

2797 

2799 

2801 

2804 

2806 

2808 

2810 

l.«l 

2810 

2813 

2816 

2817 

2819 

2822 

2824 

2826 

2828 

2831 

2833 

1.92 

2833 

2835 

2838 

2840 

2842 

2844 

2847 

2849 

2851 

2853 

2866 

1.93 

2866 

2868 

2860 

2862 

2866 

2867 

2869 

2871 

2874 

2876 

2878 

1.94 

2878 

2880 

2882 

2886 

2887 

2889 

2891 

2894 

2896 

2898 

2900 

1.96 

2900 

2903 



2909 

2911 

2914 

2916 

291 S 

2920 

2923 

1.96 

2923 

2926 

2927 

2929 

2931 

2934 

2936 

2938 

2040 

2942 

2946 

1.9% 

.2946^ 

2947 

2949 

2961 

2953 

2956 

2958 

2960 

2962 

"2964 

2967 

1.98 

• 2967 

2969 

2971 

2973 

2976 

2978 

2980 

2982 

2984 

2986 

2989 

1^ 

2989 

2991 

2993 

2996 

2997 

2999 

3002 

3004 

3006 


8010 

mammmmmm 0 m 
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Table II. — Squares op Numbers' 


N 

0 

• 1 

2 

2 

4 

5 

6 

7 

0 

9 

100 

10000 

10201 

10404 


10816 

11026 

11236 

11449 

11664 

11881 

110 

12100 

12321 

12544 

12769 

12996 

13225 

13466 

13689 

13924 

14161 

120 

14400 

14641 

14884 

15129 

15376 

15625 

15876 

16129 

16384 

16641 

130 

» 16900 

17161 

17424 

17689 

17956 

18225 

18496 

18769 

19044 

19321 

140 

19600 

19881 

20164 

20449 


21025 

21316 

21609 

21904 

22201 

150 

22500 

22801 

23104 

23409 

23716 


24336 

24649 

24964 

25281 

160 

25600 

25921 

26244 

26569 

26896 

27226 

27556 

27889 

28224 

28561 

170 

28900 

29241 

29584 

. 29929 

30276 

30625 

30976 

31329 

31684 

32041 

180 

32400 

32761 

33124 

33489 

33856 

34226 

34596 

34969 

35344 

35721 

190 

36100 

36481 

36864 

37249 

37636 

38025 

38416 

38809 

39204 

39601 

200 

40000 

40401 

40804 

41209 

41616 

42026 

42436 

42849 

43264 

43681 

210 

44100 

44521 

44944 

45369 

45796 

46225 

46656 

47089 

47524 

47961 

220 

48400 

48841 

49284 

49729 

60176 

60625 

51076 

51529 

51984 

52441 

• 230 

52900 

53361 

53824 

54289 

54756 

55226 

55696 

56169 

56644 

57121 

240 

57600 

58081 

58564 


• 59536 


60616 

61009 

61504 

62001 

250 

62500 

63001 

63504 


64516 

65025 

65536 

66049 

66564 

67081 

260 

67600 

68121 

68644 

69169 

69696 

70225 

70756 

71289 

71824 

72361 

270 

72900 

73441 

73984 

74529 

75076 

75625 

76176 

76729 

77284 

77841 

280 

78400 

78961 

79524 

80089 

80656 

81225 

81796 

82369 

82944 

83521 

290 

84100 

84681 

85262 

85849 

86436 

87025 

87616 

88209 

88804 

89401 

300 

90000 

90601 

91204 

91809 

92416 

93025 

93636 

94249 

94864 

95481 

310 

96100 

96721 

97344 

97969 

98596 

99225 

99856 

100489 

101124 

101761 

320 

102400 

103041 

103684 

104329 

104976 

105625 

106276 

106929 

107584 

108241 

330 

108900 

109561 

110224 

110889 

111556 

112225 

112896 

113569 

114244 

114921 

340 

115600 

116281 

116964 

117649 

118336 

119025 

119716 

120409 

121104 

121801 

3.50 

122500 

123201 

123904 

124609 

125316 

126025 

126736 

' 127449 

128164 

128881 

360 

129600 

130321 

131044 

131769 

132496 

133225 

133956 

134689 

135424 

136161 

370 

136900 

137641 

138384 

139129 

139876 

140625 

141376 

142129 

142884 

143641 

380 

144400 

f45161 

145924 

146689 

147456 

148225 

148996 

149769 

150544 

151321 

390 

152100 

152881 

153664 

164449 

156236 

156025 

156816 

157609 

158404 

159201 

400 

160000 

1 

160801 


162409 

163216 

164025 

164836 

165649 

166464 

167281 

410 

168100 

168921 

169744 

170569 

171396 

172225 

173056 

173889 

174724 

175561 

420 

176400 

477241 

178084 

178929 

179776 

180626 

181476 

182329 

183184 

184041 

430 

184900 

185761 

186624 

187489 

188356 

189225 

190096 

190969 

191844 

192721 

440 

193600 

194481 

195364 

196249 

197136 

1^8025 

198916 

199809 

200704 

201610 

450 

^500 

203401 

204304 

206209 

206116 

207025 

207936 

208849 

209764 

210681 

460 

211600 

212521 

213444 

214369 

216296 

216225 

217156 

218089 

2190M 

219961 

470 

220900 

221841 

222784 

223729 

224676| 

225625 

22657Gj 

227529 

228484 

22^441 

480 

230f00 

231361 

232324 

233289 

234266 

235225 

236196 

237169 

238144 

239121 

490 

240100 

241081 

242064 

243049 

244036 

245025 

24C016 

247009 

248004 

249001 

500 

250000 

2510(H 

252004 

253009 

254016 

255025 

256036 

257049 

258064 

259081 

610 

260100 

261121 

262144 


264196 

265225 

266256 

267289 

268324 

269361 

520 

270400 

^71441 

272484 

273529 

274576 

276625 

276676 

277729 

278784 

279841 

630 

^80900 

241961 

283024 

284080 

285156 

286225 

287296 

288369 


^21 

540 

•» 

291600 

292681 

• 

293764 

294849 

295936 

297025 

298116 

• 

299209 


301^1 


' Source: Wauqh, Albbrt E., Laboratory Manual and Probltma for Elem^ta^ Statistical 
Method (McQraw-Hill Book Company, Inc., 1944). 
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Table II. — Squares op Numbers. — (CorUimted) 
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Table III. — Square Roots of Numbers from 10 to 100^ 


N 

0.0 

•0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

B 

0.8 

0.9 

10 

3.162 

3.178 

3.194 

3.209 

3.226 

3.240 

3.266 

3.271 

3.286 

3.302 

11 

3.317 

3.332 

3.347 

3.362 

3.376 

3.391 

3.406 

3.421 

3.435 

3.450 

12 

3.464 

3.479 

3.493 


3.521 

3.536 

3.660 

3.564 

3.678 

3.592 

13 

8.606 

3.619 

3.633 

3,647 

3.661 

3.674 


3.701 

3.715 

3.728 

14 

3.742 

3.765 

3.768 

3.782 

3.795 

3.808 

3.821 

3.834 

3.847 

3.860 

15 

3.873 

3.886 

3.899 


3.924 

3.937 

3.950 

3.962 

3.975 

3.987 

16 

4.000 

4.012 

4.025 

4.037 

4.050 

4.062 

4.074 

4.087 

wsm 

4.111 

17 

4.123 

4.135 

4.147 

4.159 

4.171 

4.183 

4.195 

4.207 

4.219 

4.231 

18 

4.243 

4.254 

4.266 

4.278 

4.290 

4.301 

4.313 

4.324 

4.336 

4.347 

19 

4.359 

4.370 

4.382 

4.393 


4.416 

4.427 

4.438 

4.450 

4.461 

20 

4.472 

4.483 

4.494 

4.506 

4.617 

4.528 

4.539 

4.550 

4.561 

4.572 

21 

4.583 

4.593 

4.604 

4.615 

4.626 

4.637 

4.648 

4.658 

4.669 

4.680 

22 

4.690 

4.701 

4.712 

4.722 

4.733 

4.743 

4.764 

4.764 

4.775 

4.785 

• 23 

4.796 

4.806 

4.817 

4.827 

4.837 

4.848 

4.858 

4.868 

4.879 

4.889 

24 

4.899 

4.909 

4.919 

4.930 

*4.940 

4.950* 

4.960 

4.970 

4.980 

4.990 

26 

6.000 

5.010 

5.020 

5.030 

5.040 

5.050 

5.060 

5.070 

6.079 

5.089 

26 

5.099 

5.109 

5.119 

6.128 

5.138 

5.148 

5.158 

6.167 

6.177 

5.187 

27 

6.196 

6.206 

6.216 

5.226 

5.234 

5.244 

6.264 

6.263 

6.273 

5.282 

28 

6.292 

5.301 

5.310 

5.320 

5.329 

5.339 

5.348 

5.357 

5.367 

5.376 

29 

5.385 

5.394 

5.404 

5.413 

5.422 

5.431 

5.441 

5.450 

5.459 

5.468 

30 

6.477 

6.486 

5.495 

5.606 

5.514 

5.523 

5.532 

5.541 

5.550 

5.559 

31 

6.568 

6.677 

6.686 

6.695 

5.604 

5.612 

5.621 

6.630 

5.639 

5.648 

32 

6.657 

5,666 

6.674 

6,683 

5.692 

6.701 

6.710 

5.718 

5.727 

5.736 

33 

6.745 

6.753 

6.702 

6.771 

6.779 

5.788 

6.797 

5.805 

5.814 

5.822 

34 

5.831 

5.840 

5.848 

5.867 

5.865 

6.874 

5.882 

5.891 

5.899 

5.908 

95 

5.916 

5.925 

5.933 

5.941 

5.950 

5.958 

5.967 

5.975 

5.983 

5.992 

36 

6.000 

6.008 

6.017 

6.025 

6.033 

6.042 

6.060 

6.058 

6.066 

6.075 

3f 

6.083 

6.091 

6.099 

6.107 

6.116 

6.124 

6.132 

6.140 

6.148 

6.156 

38 

6.164 

6.173 

6.181 

6.189 

6.197 

6.205 

6.213 

6.221 

6.229 

6.237 

39 

6.245 

6.253 

6.261 

6.269 

6.277 

6.285 

6.293 

6.301 

6.309 

6.317 

40 

6.325 

6.332 

6,340 

6.34^ 

6.366 

6.364 

6.372 

6.380 

6.387 

6.395 

41 

6.403 

6.411 

6.419 

6.427 

6.434 

6.442 

6.460 

6.468 

6.465 

6.473 

42 

6.481 

•6.488 

6.496 

6.504 

6.612 

6.619 

6.627 

6.535 

6.542 

6.550 

43 

6.567 

6.565 

6.673 

6.680 

6.688 

6.595 

6.603 

6.611 

6.618 

6.626 

44 

6.633 

6.641 

6.648 

6.656 

6.663 

^671 

6.678 

6.686 

6.693 

6.701 

45 

6.708 

6.716 

6.723 

6.731 

6.738 

6.746 

6.763 

6.760 

6.768 

6.775 

46 

6.782 

6.790 

6.797 

6.804 

6.812 

6.819 

6.826 

6.834 

6.84^ 

6.848 

47 

6.856 

6.863 

6.870 

6.878 

6.885 

6.892 

6.899 

6.907 

6.914 

6.921 

48 

6.928 

6.935 

6.943 

6.960 

6.957 

6.964 

6.971 

6.979 

6.986 

6.093 

49 

7.000 

7.007 

7.014 

7.021 

7.029 

7.036 

7.043 

7.050 

7.057 

7.064 

50 

7.071 

7.076 

7.086 

7.092 

7.099 

7.106 

,7.113 

7.120 

7.127 

7.134 

51 

7.141 

7.148 

7.165 

7.162 

7.169 

7.176 

7.183 

7.190 

7.197 

7.204 

52 

7.211 

•7.218 

7.226 

7.232 

7.239 

7.246 

7.263 

7.259 

7.266 

7.273 

53 

^.280 

7.287 

7.294 

7.301 

7.308 

[ 7.314 

7.321 

7.328 

•7.33£w 

f.342 

54 

7.348, 

7.355 

• 

7.362 

7.369 

7.376 

7.382 

7.389 

► 

7.396 

7.403 

7.409 


1 Source: Waugh, Albbrt E., LaborcUory Manual and ProUemafor Elem^n^^ 6t€Uiatieal 
Method (McGraw-Hill Book Company, Inc., 1944). 
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Table III. — Sqtiabe Roots op Numbebs pbom 10 to 100. — (Continued) 


N 

0.0 

0.1 

D 

m 

1011111 

0.5 

0.6 

HQHH 

0.8 

0.9 

55 

7.416 

7.423 

7.430 

7.436 

7.443 

7.450 

7.457 

7.463 

7.470 

7.477 

56 

7.483 

7.490 

7.497 

7.603 

iwnrii 

7.517 

7.523 

7.530 

7.537 

7.543 

57 

7.550 


7.663 

7.570 

7.576 

7.582 

7.589 

7.596 

7.603 

7.609 

58 

7.616 

7.622 

7.629 

7.635 

7.642 

7.649 

7.655 

7.662 

7.668 

7.675 

59 

7.681 

7.688 

7.694 

7.701 

7.707 

7.714 

7.720 

7.727 

7.733 

7.740 

60 

7.746 

7.752 

7.769 

7.766 

7.772 

7.778 

7.786 

7.791 

7.797 

7.804 


vaJIiJ 

7.817 

7.823 

7.829 

7.836 

7.842 

7.849 

7.856 

7.861 

7.868 


WmSc % 

7.880 

7.887 

7.893 

7.899 

7.906 

7.912 

7.918 

7.926 

7.931 



7.944 

7.950 

7.956 

7.962 

7.969 

7.975 

7.981 

7.987 

7.994 


lajijj 

8.006 

8.012 

8.019 

8.025 

8.031 

8.037 

8.044 

8.050 

8.056 

65 

8.062 

8.068 

8.075 

8.081 

8.087 

8.093 

8.099 

8.106 

8.112 

8.118 

66 

8.124 

8.130 

8.136 

8.142 

8.149 

8.155 

8.161 

8.167 

8.173 

8.179 

67 

8.186 

8.191 

8.198 

8.204 

8.210 

8.216 

8.222 

8.228 

8.234 

8.240 

68 

8.246 

8.252 

8.258 

8.264 

8.270 

8^276 

8.283 

8.289 

8.295 

8.301 

69 

8.307 

8.313 

8.319 

8.^25 

8.331 

8.337 

8.343 

8.349 

8.355 

8.3 bl 

70 

8.367 

8.373 

8.379 

8.385 

8.390 

8.396 

8.402 

8.408 

8.4 f 4 

8.420 

71 

8.426 

8.432 

8.438 

8.444 

8.450 

8.456 

8.462 

8.468 

8.473 

8.479 

72 

8.485 

8.491 

8.497 

8.503 

8.509 

8.515 

8.521 

8.526 

8.532 

8.538 

73 

8.544 

8.550 

8.556 

8.562 

8.567 

8.573 

8.579 

8.585 

8.591 

8.597 

74 

8.602 

8.608 

8.614 

8.620 

8.626 

8.631 

8.637 

8.643 

8.649 

8.654 

75 

8.660 

8.666 

8.672 

8.678 

8.683 

8.689 

8.695 

8.701 

8.706 

8.712 

76 

8.718 

8.724 

8.730 

8.735 

8.741 

8.746 

8.752 

8.758 

8.764 

8.769 

77 

8.775 

8.781 

8.786 

8.792 

8.798 

8.803 

8.809 

8.815 

8.820 

8.826 

78 

8.832 

8.837 

8.843 

8.849 

8.854 

8.860 

8.866 

8.871 

8.877 

8.883 

79 

8.888 

8.894 

8.899 

8.905 

8.911 

8.916 

8.922 

8.927 

8.933 

8.939 

.80 

8.944 

8.950 

8.955 

8.961 

8.967 

8.972 

8.978 

8.983 

8.989 

8.994 

81 

9.000 

9.006 

9.011 

9.017 

9.022 

9.028 

9.033 

9.039 

9.044 

9.050 

82 

9.055 

9.001 

9.066 

9.072 

9.077 

9.083 

9.088 

9.094 

9.099 

9. f 05 

83 

9.110 

9.116 

9.121 

9.127 

9.132 

9.138 

9.143 

9.149 

'9.154 

9.160 

84 

9.165 

9.1^1 

9.176 

9.182 

9.187 

9.192 

9.198 

9.203 

9.209 

9.214 

85 

9.220 

9.225 

9.230 

9.236 

9.241 

9.247 

9.252 

9.257 

9.263 

9.268 

86 

9.274 

9.279 

9.284 

9.290 

i 9.296 

9.301 

9.306 

9.311 

9.317 

9.322 

87 

9.327 

9.333 

9.338 

9.343 

9.349 

9.354 

9.359 

9.365 

J .370 

9.376 

88 

9.381 

9.386 

9.391 

9.397 

9.402 

9.407 

9.413 

9.418 

9.423 

9.429 

89 

9.434 

9.439 

9.445 

9.450 

9.455 

9.460 

9.466 

9.471 

9.463 

9.482 

90 

9.487 

9.492 

9.497 

9.503 

9.508 

9.513 

9.518 

9.524 

9.529 

9.534 

91 

1 9,&9 

9.545 

9.550 

9.555 

9.560 

9.566 

9.571 

9.576 

9.581 

9.586 

92 

9.592 

9.597 

9.602 

9.607 

9.612 

9.618 

9.623 

9.628 

9.633 

9.638 

93 

9.644 

9.649 

9.654 

9.659 

9.664 

9.670 

9.675 

9.680 

9.685 

9.690 

94 

9.695 

9.701 

9.706 

9.711 

9.716 

9.721 

9.726 

9.731 

9.737 

m 

9.742 

96 

9.747 

9.752 

9,757 

9.762 

9.767 

9.772 

9.778 

9.783 

9.788 y 

9.793 

96 

9.798 

9.803 

9.808 

9.813 

9.818 

9.823 

9.829 

9.834 

9.839 

9.844 

97 

,^9.849 

9.854 

9.859 

9.864 

9.869 

9.874 

9.879 

9.884 

C .889 

9.894 

98 

9.^99 

9.905 

9.910 

9.915 

9.920 

9.925 

9.930 

9.935 

9.940 

/).945 

99, 

9.950 

9.955 

9.960 

9.965 

u. 

9.970 

9,975 

9.980 

9.985 

9.990 

9.995 











APPENDIX 


689 




Table IV. — Square Roots op Numbers from 100 to 1000^ 


N 

0 

B 

2 

3 

B 

■ 

6 

7 

8 

9 

100 

10.00 

10.05 


10.16 

10.20 


■ni 

10.34 

10.39 


110 

10.49 

10.64 

10.68 

10.63 

10.68 



10.82 

10.86 

10.91 

120 

10.96 

11.00 

11.06 

11.09 

11.14 

11.18 

E [IF B 

11.27 

11.31 

11.36 

130 

11.40 

11.46 

11.49 

11.63 

11.68 

11.62 

R IRS 

11.70 

11.75 

11.79 

140 

11.83 

11.87 

11.92 

11.96 

12.00 

12.04 

12.08 

12.12 

12.17 

12.21 

160 

12.26 

12.29 

12.33 

12.37 

12.41 

12.45 

12.49 

12.63 

12.67 

12.61 

160 

12.66 

12.69 

12.73 

12.77 

12.81 

12.86 

12.88 

12.92 

12.96 

13.00 

170 

13.04 

13.08 

13.11 

13-16 

13.19 

13.23 

13.27 

13.30 

13.34 

13.38 

180 

13.42 

13.45 

13.49 

13.63 

13.66 

13.60 

13.64 

13.67 

13.71 

13.75 

190 

13.78 

13.82 

13.86 

13.89 

13.93 

13.96 

14.00 

14.04 


14.11 

200 

14.14 

14.18 

14.21 

14.26 

14.28 

14.32 

14.36 

14.39 

14.42 

14.46 

210 

14.49 

14.63 

14.66 

14.69 

14.63 

14.66 

14.70 

14.73 

14.76 

14.80 

220 

14.83 

14.87 

14.90 

14.93 

14.97 

16.00 

15.03 

15.07 

15.10 

15.13 

230 

16.17 

16.20 

16.23 

16.26 

16.30 

16.33 

15.36 

15.39 

15.43 

16.46 

240 

16.49 

16.62 

16.66 

16.69 

,16.62 

15.66 

« 

15.68 

16.72 

15.75 

16.78 

260 

16.81 

15.84 

16.87 

16.91 

16.94 

16.97 

16.00 

16.03 


16.09 

260 

16.12 

16.16 

16.19 

16.22 

16.26 

16.28 

16.31 

16.34 

16.37 

16.40 

270 

16.43 

16.46 

16.49 

16.62 

16.66 

16.68 

16.61 

16.64 


16.70 

280 

16.73 

16.76 

16.79 

16.82 

16.85 

16.88 

16.91 

16.94 

16.97 

17.00 

290 

17.03 

17,06 

17.09 

17.12 

17.16 

17.18 

17.20 

17.23 

17.26 

17.29 

300 

17.32 

17.36 

17.38 

17.41 

17.44 

17.46 

17.49 

17.62 

17.65 

17.68 

310 

17.61 

17.64 

17.66 

17,69 

17.72 

17.76 

17.78 

17.80 

17.83 

17.86 

320 

17.89 

17.92 

17.94 

17.97 

18.00 

18.03 

18.06 

18.08 

18.11 

18.14 

330 

18.17 

18.19 

18.22 

18.26 

18.28 

18.30 

18.33 

18.36 

18.38 

18.41 

340 

18.44 

18.47 

18.49 

18.52 

18.65 

18.67 

18.60 

18.63 

18.65 

18.68 

360 

18.71 

18.74 

18.76 

18.79 

18.81 

18.84 

18.87 

18.89 

18.92 

18.95 

360 

18.97 

19.00 

19.03 

19.05 

19.08 

19.10 

19.13 

19.16 

19.18 

19.21 

370 

19.24 

19.26 

19,29 

19.31 

19.34 

19.36 

19.39 

19.42 

19.44 

19.47 

38^ 

19.49 

19.62 

19.64 

19.67 

19.60 

19.62 

19.65 

19.67 

19.70 

19.72 

390' 

19.76 

f 9.77 

19.80 

19.82 

19.86 

19.87 

19.90 

19.92 

19.96 

19.98 

400 

20.00 

20.02 

20.05 

20.07 

20.10 

20.12 

20.16 

20.17 

20.20 

20.22 

410 

20.26 

20.27 

20.30 

20.32 « 

20.35 

20.37 

20.40 

20.42 

20.44 

20.47 

420 

20.49 

20.62 

20.64 

20.67 

20.69 

20.62 

20.64 

20.66 

20.69 

20.71 

430 

20.74 

to . 76 

20.78 

20.81 

20.83 

20.86 

20.88 

20.90 

20.93 

20.95 

440 

20.08 

21.00 

21.02 

21.06 

21.07 

21.10 

21.12 

21.14 

21.17 

21.19 

460 

21.21 

21.24 

21.26 

21.28 

21.31 

21.33 

21.36 

21.38 

21.40 

21.42 

460 

24.46 

21.47 

21.49 

21.62 

21.64 

21.66 

21.69 

21.61 

21.63 

21.66 

470 

21.68 

21.70 

21.73 

21.76 

21.77 

21.79 

21.82 

21.84 

21.8 Cf 

21.89 

480 

21.91 

21.93 

21.95 

21.98 

22.00 1 

22.02 

22.06 

22.07 

22.09 

2241 

490 

22.14 

22.16 

22.18 

22.20 

22.23 

22.26 

22.27 

22.29 

22.32 

22.34 

600 

22.36 

22.38 

22.41 

22.43 

22.45 

22.47 

22.49 

22.62 

22.54 

22.66 

610 

22.68 

22.61* 

22.63 

22.66 

22.67 

22.69 

22.72 

22.74 

22.76 

22.78 

620 

22.80 

22.83 

22.86 

22.87 

22.89 

22.91 

22.93 

22.96 

22.98 

23.00 

630 

23.02 

«3.04 

23.07 

23.09 

23.11 

23.13 

23.15 

23.17 

23.19 

23.22 

540 

28.24 

23*26 

23.28 

23.30 

23.32 

23.36 

23.37 

23.39 

53.41 ' 

23.43 

660 

23.45 

.23.47 
*- 

23.49 

23.62 

23.64 

23.66 

23.68 

A., 



23^ 


^ ^ Source*. Waugh, Albbjrt E., Laboratory Manual and Problems for Elements of Statistical 
Method (MoGrA^’lIiU Book Company, Inc., 1944), 















690 ELEMENTARY STATISTICS AND APPLICATIONS 


Tablb IV. — Sqttarb Roots of Numbers from 100 to 1000. — {Continued) 


N 

0 

D 


3 

4 

5 

6 

n 

8 

9 

550 

23.45 

23.47 

23.49 

23.52 

23.54 

23.56 

23.58 

23.60 

23.62 

23.64 

560 

23.66 

23.69 

23.71 

23.73 

23.75 

23.77 

23.79 

23.81 

23.83 

23.85 

570 

23.87 

23.90 

23.92 

23.94 

23.96 

23.98 

24.00 

24.02 

24.04 

24.06 

580 

24.08 

24.10 

24.12 

24.15 

24.17 

24.19 

24.21 

24.23 

24.25 

24.27 

590 

24.29 

24.31 

24.33 

24.35 

24.37 

24.39 

24.41 

24.43 

24.45 

' 24.47 

600 

24.49 

24.52 

24.54 

24.56 

24.58 

24.60 

24.62 

24.64 

24.66 

24.68 

610 

24.70 

24.72 

24.74 

24.76 

24.78 

24.80 

24.82 

24.84 

24.86 

24.88 

620 

24.90 

24.92 

24.94 

24.06 

24.98 

26.00 

25.02 

26.04 

25.06 

25.08 

680 

25.10 

25.12 

25.14 

25.16 

25.18 

26.20 

25.22 

26.24 

25.26 

25.28 

640 

25.30 

25.32 

25.34 

25.36 

25.38 

26.40 

26.42 

26.44 

26.46 

25.48 

650 

25.50 

25.51 

25.53 

25.55 

25.57 

26.59 

25.61 

25.63 

25.66 

25.67 

660 

25.69 

25.71 

25.73 

25.75 

25.77 

25.79 

25.81 

26.83 

25.85 

25.86 

670 

25.88 

25.90 

25.92 

25.94 

26.96 

26.98 

26.00 

26.02 

26.04 

26.06 

680 

26.08 

26.10 

26.12 

26.13 

26.15 

26.17 

26.19 

26.21 

26.23 

26.25 

690 

26.27 

26.29 

26.31 

26.32 

26.34 

23.36 

26.38 

26.40 

26.42 

26.44 

700 

26.46 

26.48 

26.50 

26.51 

26.63 

26.65 

26.67 

26.59 

26.61 

26.63 

710 

26.65 

26.66 

26.68 

26.70 

26.72 

26.74 

mawim 

26.78 

26.80 

26.81 

720 

26.83 

26.85 

26.87 

26.89 

26.91 

26.93 

26.94 

26.96 

26.98 

27.00 

730 

27.02 

27.04 

27.06 

27.07 

27.09 

27.11 

27.13 

27.15 

27.17 

27.18 

740 

27.20 

27.22 

27.24 

27.26 

27.28 

27.29 

27.31 

27.33 

27.36 

27.37 

750 

27.39 

27.40 

27.42 

27.44 

27.46 

27.48 

27.50 

27.51 

27.63 

27.55 

760 

27.57 

27.59 

27.60 

27.62 

27.64 

27.66 

27.68 

27.69 

27.71 

27.73 

770 

27.75 

27.77 

27.78 

27.80 

27.82 

27.84 

27.86 

27.87 

27.89 

27.91 

780 

27.93 

27.95 

27.96 

27.98 

28.00 

28.02 

28.04 

28.05 

28.07 

28.09 

790 

28.11 

28.12 

28.14 

28.16 

28.18 

28.20 

28.21 

28.23 

28.25 

28.27 

800 

28.28 

28.30 

28.32 

28.34 

28.35 

28.37 

28.39 

28.41 

28.43 

28.44 

810 

28.46 

28.48 

28.50 

28.51 

28.53 

28.65 

28.67 

28.58 

28.60 

28.62 

820 

28.64 

28.65 

28.67 

28.69 

28.71 

28.72 

28.74 

28.76 

28.78 

28.79 

830 

28.81 

28.83 

28.84 

28.86 

28.88 

28.90 

28.91 

28.93 

^ 28.96 

28.97 

840 

28.98 

29.00 

29.02 

29.03 

29.05 

29.07 

29.09 

29.10 

' 29.12 

29. 14 

850 

29.15 

29.17 

29.19 

29.21 

' 29.22 

29.24 

29.26 

29.27 

29.29 

29.31 

860 

29.33 

29.34 

29.36 

29.38 

29.39 

2^.41 

29.43 

29.44 

29.46 

29.48 

870 

29.50 

29.51 

29.53 

29.55 

29.66 

29.58 

29.60 

29.61 

29.63 

29.65 

880 

29.66 

29.68 

29.70 

29.72 

29.73 

29.76 

29.77 

29.78 

29.80 

29.82 

890 

29.83 

29.85 

29.87 

29.88 

29.90 

29.92 

29.93 

29.95 

29.97 

1 

29.98 

900 

30.00 

30.02 

30.03 

30.05 

30.07 

30.08 

30.10 

30.12 

30.13 

30.15 

910 

30 U 7 

30.18 

30.20 

30.22 

30.23 

30.25 

30.27 

30.28 

30.30 

' 30.32 

920 

30.33 

30.35 

30.36 

30.38 

30 .^ 

30.41 

30.43 

30.45 

30.46 

30.48 

930 

30.50 

30.51 

30.53 

30.54 

30.56 

30.58 

30.59 

30.61 

I 30.63 

30.64 

940 

30.66 

30.68 

30.69 

30.71 

30.72 

30.74 

30.76 

30.77 

30.79 

30.81 

950 

30.82 

30.84 

30.85 

30.87 

30.89 

30.90 

30.92 

30^94 

30.95 

30.97 

960 

30.98 

31.00 

31.02 

31.03 

31.05 

31.06 

31.08 

31.10 

31.11 

31.18 

970 

31.14 

31.16 

31.18 

31.19 

31.21 

31.22 

31.24 

31.26 

31.27 

31.29 

980 

3 i ».30 

' 31.32 

31.34 

31.35 

31.37 

31.38 

31.40 

31.42 

31.43 

31.45 

990 

•* 

81.46 

31.48 

31.50 

31.51 

md . tmmmmmm 

31.53 

31.54 

31.56 

31.58 

31.59 

^' 81.61 
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Table V. — ^Reciprocals of Numbers^ 


N 

.00 

m 

.02 

.03 

m 


.06 

.07 

.08 

.09 

1.00 

1.0000 

.9901 

.9804 

.9709 

.9616 

.9524 

.9434 

.9346 

.9269 


1.10 

.9091 

.9009 

.8929 

.8850 

.8772 

.8696 

.8621 

.8547 

.8476 

Bl 

1.20 

.8333 

.8264 

.8197 

.8130 

.8065 

.8000 

.7937 

.7874 

.7812 


1.30 

,7692 

.7634 

.7676 

.7519 

.7463 

.7407 

.7363 

.7299 

.7246 

.7194 

1.40 

.7143 

.7092 

.7042 

.6993 

.6944 

.6897 

.6849 

.6803 

.6767 

.6711 

1.60 

.6667 

.6623 

.6679 

.6536 

.6494 

.6462 

.6410 

.6369 

.6329 

.6289 

1.60 

.6260 

.6211 

.6173 

.6136 

.6098 

.6061 

.6024 

.6988 

.6962 

.6917 

1.70 

.6882 

.5848 

.6814 

.6780 

.6747 

.6714 

.5682 

.6660 

.6618 

.6687 

1.80 

.6666 

.6526 

.5495 

.6464 

.6436 

.6405 

.6376 

.6348 

.6319 

.6291 

1.90 

.6263 

.6236 

.6208 

.6181 

.6156 

.6128 

.6102 

.6076 

.6051 

.6026 

2.00 

.6000 

.4976 

.4960 

.4926 

.4902 

.4878 

.4854 

.4831 

.4808 

.4786 

2.10 

.4762 

.4739 

.4717 

.4694 

.4673 

.4661 

.4630 

.4608 

.4587 

.4666 

2.20 

.4646 

.4525 

.4504 

.4484 

.4464 

.4444 

.4426 

.4405 

.4386 

.4367 

,2.30 

.4348 

.4329 

.4310 

.4292 

.4274 

.4255 

.4237 

.4219 

.4202 

.4184 

2.40 

.4167 

.4149 

.4132 

.4116 

^.4098 

.4082* 

.4066 

.4049 

.4032 

.4016 

2.60 

.4000 

.3984 

.3968 

.3963 

.3937 

.3922 

.3906 

.3891 

.3876 

.3861 

2.60 

.3846 

.3831 

.3817 

.3802 

.3788 

.3774 

.3759 

.3745 

.3731 

.3717 

2.70 

.3704 

.3690 

.3676 

.3663 

.3660 

.3636 

.3623 

.3610 

.3597 

.3684 

2.80 

.3671 

.3659 

.3546 

.3534 

.3621 

.3509 

.3496 

.3484 

.3472 

.3460 

2.90 

.3448 

.3436 

.3426 

.3413 

.3401 

.3390 

.3378 

.3367 

.3356 

.3344 

3.00 

.3333 

.3322 

.3311 

.3300 

.3289 

.3279 

.3268 

.3267 

.3247 

.3236 

3.10 

.3226 

.3216 

.3205 

.3195 

.3186 

.3175 

.3165 

.3166 

.3145 

.3136 

3.20 

.3125 

.3115 

.3106 

.3096 

.3086 

.3077 

.3067 

.3058 

.3049 

.3040 

3.30 

.3030 

.3021 

.3012 

.3003. 

.2994 

.2985 

.2976 

.2967 

.2959 

.2960 

3.40 

.2941 

.2933 

.2924 

.2915 

.2907 

.2899 

.2890 

.2882 

.2874 

.2866 

3.60 

.2857 

.2849 

.2841 

.2833 

.2825 

.2817 

.2809 

.2801 

.2793 

.2786 

3? 60 

.2778 

.2770 

.2762 

.2756 

.2747 

.2740 

.2732 

.2725 

.2717 

.2710 

3. “no 

.2703 

.2696 

.2688 

.2681 

.2674 

.2667 

.2660 

.2653 

.2646 

.2639 

3.80 

.2632 

#2625 

.2618 

.2611 

.2604 

.2697 

.2591 

.2584 

.2677 

.2671 

3.90 

.2564 

.2658 

.2551 

.2546 

.2538 

.2632 

.2626 

.2519 

.26 i 3 

.2606 

4.00 

.2600 

.2494 

.2488 

.2481 

.2476 

.246 Sr 

.2463 

.2467 

.2461 

.2445 

4.10 

.2439 

.2433 

.2427 

.2421 

.2415 

.2410 

.2404 

.2398 

.2392 

.2387 

4.20 

.2381 

^2376 

.2370 

.2364 

.2368 

.2353 

.2347 

.2342 

.2336 

.2331 

4.30 

.2326 

.2320 

.2316 

.2309 

.2304 

.2299 

.2294 

.2288 

.2283 

.2278 

4.40 

.2273 

.2268 

.2262 

.2257 

.2262 

^247 

.2242 

.2237 

.2232 

.2227 

4.60 

.2222 

.2217 

.2212 

.2208 

.2203 

.2198 

.2193 

.2188 

.2183 

.2179 

4.60 

.*2174 

.2169 

.2164 

.2160 

.2155 

.2151 

.2146 

.2141 

.213> 

.2132 

4.70 

.2128 

.2123 

.2119 

.2114 

.2110 

.2106 

.2101 

.2096 

.2092 

.2988 

4.80 

.2083 

.2079 

.2076 

.2070 

.2066 

.2062 

.2058 

.2053 

.2049 

.2046 

4.90 

.2041 

.2037 

.2033 

.2028 

.2024 

.2020 

.2016 

.2012 

.2008 

.2004 



.199^ 

.1992 

.1988 

.1984 

.1980 

.1976 

.1972 

.1968 

.1966 

5.10 

.1961 

.1967 

.1953 

.1949 

.1946 

.1942 

.1938 

.1934 

.1930 

.1927 

6.20 

.1923 

^1919 

.1916 

.1912 

.1908 

.1906 

.1901 

.1898 

.1894 

.1890 

6.30 

1 .1887 

.1^83 

.1880 

.1876 

.1873 

.1869 

.1866 

.1862 

•.1869. 

.^866 

6.40 

►rTl862 

.1848 
! I . 

.1846 

.1842 

.1838 

.1836 

.1832 

i 

.1828 

^1825 

1* 


’ 1 Source: Watjqh, Albert E., Laboratory Manual and ProUemsfor Elem^rUo^ Statistical 

Method (McGraw-Hill Book Company, Inc., 1944). 
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Table Y. — Reciprocals of Numbers. — (.Continued) 
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1 


Table VI. — Abbas under the Normal Curve 
Fractional parts of the total area (1.000) under the normal curve between 
the mean and# a perpendicular erected at various numbers of standard 
^deviations {x/a) from the mean.^ To illustrate the use of the table, 39.065 
per cent of the total area under the curve will lie between the mean and a 
perpendicular erected at a distance of 1.23<r from the mean,’ 

Each figure in the body of the table is preceded by a decimal point. 


x/<r 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

0.0 


00399 

00798 

01197 

01596 

01994 

02392 

02790 

WM 

03686 

0.1 

03983 

04380 

04776 

05172 

06667 

06962 

06356 

06749 

mSttSi 

07536 

0.2 


08317 

08706 

09095 

09483 

09871 

10257 


11 ^] 

11409 

0.3 

11791 

12172 

12552 

12930 

13307 

13683 

14068 

14431 


16)73 

0.4 

15554 

16910 

16276 

16640 

17003 

17364 

17724 

18082 

Km 

18793 

0.5 

19146 

19497 

19847 

20194 

20460 

20884 

21226 

21566 


22240 

0.6 

22675 

22907 

23237 

23666 

23891 

24216 

24637 

24867 


25490 

• 0.7 

25804 

26115 

26424 

26 g 30 

27035 

273^7 

27637 

27936 


28524 

0.8 

28814 

29103 

29389 

29673 

29955 

30234 

30511 

30785 


31327 

0.9 

31694 

81869 

32121 

32381 

32639 

32894 

33147 

33398 


33891 

1.0 

34134 

34375 

34614 

34850 

35083 

35313 

35543 

35769 


36214 


36433 

36650 

36864 

37076 

37286 

37493 

37698 

37900 


38298 

1:2 

38493 

38686 

38877 

39065 

39251 

39435 

39617 

39796 


40147 

1.3 

40320 

40490 

40658 

40824 

40988 

41149 

41308 

41466 


41774 

1.4 

41924 

42073 

42220 

42364 

42607 

42647 

42786 

42922 

43056 

43189 

1.6 

43319 

43448 

43574 

43699 

43822 

43943 

44062 

44179 

44296 

44408 

1.6 

44520 

44630 

44738 

44845 

44950 

45053 

46154 

45254 

45352 

45449 

1.7 

45643 

45637 

46728 

45818 

45907 

45994 

46080 

46164 

46246 

46327 

1.8 

46407 

46485 

46562 

46638 

46712 

46784 

46856 

46926 

46995 

47062 

1.0 

47128 

47193 

47267 

47320 

47381 

47441 

47600 

47558 

47616 

47670 

• 

2^0 

47725 

47778 

47831 

47882 

47932 

47982 

48030 

48077 

48124 

48169 

2^ 

48214. 

48257 

48300 

48341 

48382 

48422 

48461 

48500 

48537 

48574 

2.2 


48645 

48679 

48713 

48745 

48778 

48809 

48840 

48870 

48899' 

2.3 

48928 

48950 

48983 

49010 

49036 

49061 

49086 

49111 

49134 

49168 

2.4 

49180 

49202 

49224 

49245 

49266 

49^ 

49305 

49324 

49343 

49361 

2.5 

49379 

49396 

49413 

40430 

49446 

49461 

49477 

49492 


49520 

2.6 

49534* 

49547 

49560 

49573 

49585 

49598 

49609 

49621 

49632 

49643 

. 2.7 

49653 

49664 

49674 

49683 

49693 

49702 

49711 

49720 

49728 

49736 

2.8 

49744 

49752 

49760 

49767 

49774 

* 49781 

49788 

49795 

49801 

49807 

2.0 

49813 

49810 

49826 

49831 

49836’ 

49841 

49846 

49851 

49856 

40861 

2.0 

49865 








■ 

► 

2.6 

4997674 










4.0 

4999683 










4.6 

4990966 










6.0 

4909997133 








■ 



^ This tabic haf^bcen. adapted, by permission, from F. C. ICcnt, Elements of Statistics” 
^McGra^jHill Bool^ Company, Inc., 1924). • • • ^ 
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Table VII. — Ordinates of the Normal Curve 
Ordinates (heights) of the standard normal curve. ^ The height (y) at 
any distance (a?) from the mean is ^ 

i 

y - 0.39894e ^ 

To make the curve fit a histogram in which the abscissa scale is measured 
in original x units instead of standard-deviation {x/<r) units, multiply these 
ordinates by Ni/<r where N is the number of cases, i the class intecval, and 
<r the standard deviation. 

Each figure in the body of the table is preceded by a decimal point. 


x/v 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

0.0 

39894 

39892 

3988C 

39876 

39862 

39844 

39822 

39797 

39767 

39733 

0.1 

39695 

39654 

39608 

39559 

39505 

39448 

39387 

39322 

39253 

39181 

0.2 

39104 

39024 

38940 

38853 

38762 

38667 

38568 

38466 

38361 

38251 

0.3 

38139 

38023 

37903 

37780 

37654 

37524 

37391 


37116 

36973 

0.4 

36827 

36678 

-36526 

36371 

36213 

36053 

35889 

35723 

35553 

353§1 

0.6 

35207 

35029 

34840 

34667 

84482 

34294 

34105 

33912 

33718 

33521 

0.6 

33322 

33121 

32918 

32713 

32506 

32297 

32086 

31874 

31659 

31443 

0.7 

31225 

31006 

30785 

30563 

30339 

30114 

29887 

29658 

29430 

29200 

0.8 

28969 

28737 

28504 

28269 

28034 

27798 

27562 

27324 

27086 

26848 

0.9 

26609 

26369 

26129 

25888 

25647 

25406 

25164 

24923 

24681 

24430 

1.0 

24197 

23955 

23713 

23471 

23230 

22988 

22747 

22506 

22265 

22025 

1.1 

21785 

21546 

21307 

21069 

20831 

20594 

20357 

20121 

19886 

19652 

1.2 

19419 

19186 

18954 

18724 

18494 

18266 

18037 

17810 

17585 

17360 

1.3 

17137 

16915 

16694 

16474 

16266 

16038 

16822 

15608 

15395 

15183 

1.4 

14973 

14764 

14556 

14350 

14146 

13943 

13742 

13542 

13344 

13147 

1.5 

12952 

12758 

12566 

12376 

12188 

12001 

11816 

11632 

11450 

11270 

1.6 

11092 

10915 

10741 

10567 

10396 

10226 

10069 


09728 

09566 

1.7 

09405 

09246 

09089 

08933 

08780 

08628 

08478 

08329 

08183 

08033 

1.8 

07895 

07754 

07614 

07477 

07341 

07206 

07074 

06943 

06814 

06687 

1.9 

06562 

06438 

06316 

06195 

06077 

05959 

05844 


05618 

05508 

2.0 

05399 

05292 

05186 

05082 

04980 

04879 

04780 

04682 

04586 

04491 

2.1 

04398 

04307 

04217 

04128 

04041 

03955 

03871 

Em 

03706 

03626 

2.2 

03547 

03470 

03394 

03319 

03246 

03174 

03103 

03034 

02965 

02893 

2.3 

02833 

02768 

02706 

02643 

02582 

02522 

02463 

0^406 

02349 

02294 

2.4 

02239 

02186 

02134 

02083 

02033 

01984 

01936 

01888 

01842 

01797 

2.5 

01763 

01709 

01667 

01625 

01585 

01545 

01506 


01431 

01394 

2..A 

01358 

01323 

01289 

01256 

01223 

01191 

01160 

01130 

01100 

01071 

2.7 

01042 

01014 

00987 

00961 

00936 

mm 

00885 

00861 

00837 

00814 

2.8 

00792 

00770 

00748 

00727 

•00707 

00687 

00668 

00649 

00631 

00613 

2.9 

00595 

00678 

00662 

00545 

00530 

00514 

00499 

00485 


00457 

8.0 

00443 

1 









8.5 

0008727 










4.0 

0001338 










( 4.5 

, 0000160 . 










6^0 

< 

000001487 











* Thie Cable adapted, by permission, from Kent, "Elements of Statistics." 

Ordi^.atij may also be computed from the equation log y - 9.600910 — 10 — 0.21714? x* 
and for ordinatea beyond 3^ it wonld bo pfcesaary t^ use log y - 9.6009100668 — 10 - 
0.21714729laP^ 
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Table VIII. — Hypkrbolk ’ Tangents ^ 


z 

> 

r =» tauh z 

z 

r “» tanh z 

z 

r «= tanh ;; 

0.00 

0.00000 

0.55 

0.50052 

1.10 

0.80050 

0.10 

.01000 

0.56 

. 50798 

1.11 

.80406 

0.02 

.02000 

0.57 

. 51536 

1.12 

.80757 

Q . 0 ^ 

.02999 

0.58 

. 52267 

1.13 

.81102 

0.04 

.03998 

0.59 

. 52990 

1.14 

,.81441 

0.05 

0.04996 

0.60 

0.53705 

1.15 

0.81775 

0.06 

.05993 

0.61 

.54413 ' 

1.16 

.82104 

0.07 

.06989 

0.62 

.55113 

1.17 

.82427 

0.08 

.07983 

0.63 

. 55805 

1.18 

.82745 

0.09 

.08976 

. 0.64 

. 56490 

1.19 

! .83058 

0.10 

0.09967 

0.65 

0.57167 

1.20 

0.83366 

0.11 

. 10950 

0.66 

.57836 

1.21 

.83668 

0.12 

.11943 

0.67 

! . 58498 

1.22 

.83965 

0.13 

. 12927 

0.68 

.59152 

1.23 

. 84258 

0.14 

. 13909 

0.69 

. 59798 

1.24 

.84546 

, 0.15 

0.14889 

0,70 

0.60437 

1.25 

0.84828 

0.16 

.15865 

0.71 » 

.61068 

1.26 

.85106 

0.17 

. 16838 

0.72 

.61691 

1.27 

.85380 

0.18 

.17808 

0.73 

.62307 

1.28 

.85648 

0.19 

. 18775 

0.74 

.62915 

1.29 

.85913 

0.20 

0.19738 

0.75 

0.63515 

1.30 

0.86172 

0.21 

.20697 

0.76 

.64108 

1.31 

.86428 

0.22 

.21652 

0.77 

.64693 

1.32 

.86678 

0.23 

. 22603 

0.78 

.65271 

1.33 

.86925 

0.24 

.23550 

0.79 

.65841 

1.34 

.87167 

0.25 

0.24492 

0.80 

0.66404 

1 . 35 i 

0.87405 

0.26 

. 25430 

0.81 

.66959 

1.36 

. 87639 

0.27 

. 26362 

0.82 

.67507 

1.37 

. 87869 

0.28 

.27291 

0.83 

.68048 

1.38 

.88095 

0.29 

.28213 

0.84 

.68.581 

1.39 

.88317 

0.30 

0.29131 

0.85 

0.69107 

1.40 

0.88535 

0.31 

.30044 

0.86 

.69626 

1.41 

. 88749 

0.32 , 

.30951 

0.87 

.70137 

1.42 

. 88960 

^0.33 

.31852 

0.88 

.70642 

1.43 

.89167 

Q .34 

.32748 

0.89 

.71139 

1.44 

. 89370 

0.*35 

t ).. 33638 

0.90 

0.71630 

1.45 

0.89569 

0.36 

.34521 

0.91 

.72113 

1.46 

. 89765 

0.37 

. 35399 

0.92 

.72590 

1.47 

. 89958 

0.38 

.36271 

0.93 

.73059 ♦ 1 

1.48 

.90147 

0.39 

.37130 

0.94 » 

.73522 

1.49 

. 90332 

0.40 

0.37995 

0.95 

0.73978 

1.50 

0.90515 

0.41 

• .38847 

0.96 

.74428 

1.51 

.90694 

0.42 

. 39693 

0.97 

.74870 

1.52 

.90870 

* 0.43 

.40532 

0.98 

.75.307 

1.53 

.91042 

0.44 

.41364 

0.99 

.75736 

1.54 

.91212 

0.45m 

0.42190 

1.00 

0.76159 

1.55 

0.91379 

0.46 

. 43008 

1.01 

.76576 

1.56 

.94542 

0.47 

.43820 

1.02 

.76987 

1.57 

.91703 . 

0.48 

.44624 

1.03 

.77391 

1.58 

.91860 

0.49 

.45422 

1.04 

.77789 

1.59 

.92015 

0.50 

0.46212 

1.05 

0.78181 

1.60 

0.92167 

0.51 

.46995 

1.06 

.78566 

1.61 

.92316 

0.52 

.47 V *70 

1.07 

.78946 

1.62 

.92462 

0.53 

.48538 

1.08 

.79320 

1.63 

.92606 

0.54 

.49299 

1.09 

.79688 

i 

1.64 

.92747 

. J ! t ,..:,.. 


1 Source: Hodoman, Charles C., Mathematical Tables from Handbook of Chemistry and 
Physics (1941), • * • 
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Table VIII.— Hyperbolic Tangents. — (Continued) 


z 

r «= tanh z 

z 

r » tanh z 

z 

r “ tanh z 

1.65 

0.92886 

2.20 

0.97574 

2.75 , 

0.99186 

1.66 

.93022 

2.21 

.97622 

2.76 

.99202 

1.67 

.93155 

2.22 

.97668 

2.77 

.99218 

1.68 

.93286 

2.23 

.97714 

2.78 

.99233 

1.69 

.93415 

2.24 

.97759 

2.79 

.99248 

1.70 

0.93541 

2.25 

0.978 J )3 

2.80 

0.99263 

1.71 

.93665 

2.26 

.97846 

2.81 

.99278 

1.72 

.93786 

2.27 

.97888 

2.82 

.99292 

1.73 

.93906 

2.28 

.97929 

2.83 

.99306 

1.74 

.94023 

2.29 

.97970 

2.84 

.99320 

1.75 

0.94138 

2.30 

0.98010 

2.85 

0.99333 

1.76 

.94250 

2.31 

.98049 

2.86 

.99346 

1.77 

.94361 

2.32 

.98087 

2.87 

.99359 

1.78 

.94470 

2.33 

.98124 

2.88 

.99372 

1.79 

.94576 

2.34 

.98161 

2.89 

.99384 

1.80 

0.94681 

2.35 

0.98197 

2.90 

0.99396 • 

1.81 

.94783 

f 2.36 

». 98233 

2.91 

.99408 

1.82 

.94884 

2.37 

.98267 

2.92 

.99420 

1.83 

.94983 

2.38 

.98301 

2.93 

.99431 

1.84 

.95080 

2.39 

.98335 

2.94 

.99443 

1.86 

0.95175 

2.40 

0.98367 

2.95 

0.99454 

1.86 

.95268 

2.41 

.98400 

2.96 

.99464 

1.87 

.95359 

2.42 

1 .99431 

2.97 

.99475 

1.88 

.95449 

2.43 

i .98462 

2.98 

.99485 

1.89 

.95537 

2.44 

.98492 

2.99 

.99496 

1.90 

0.95624 

2.45 

0.98522 

3.0 

0.99505 

1.91 

.95709 

2.46 

.98551 

3.1 

.99.595 

1.92 

. 95792 

2.47 

.98579 

3.2 

. 99668 

1.93 

.9.5873 

2.48 

.98607 

3.3 

.99728 

1.94 

.95953 

2.49 

.98635 

3.4 

.99777 

1.95 

0.96032 

2.50 

0.98661 

3.5 

0.99818 

1.96 

.96109 

2.51 

.98688 

3.6 

.99851 

1.97 

.96185 

2.52 

.98714 

3.7 

. .99878 

1.98 

.96259 

2.53 

.98739 

3.8 

.99900 

1.99 

.96.331 

2.54 

.98764 

3.9 

.99918 

2.00 

0.96403 

2.55 

0.98788 

4.0 • 

0.99933 

2.01 ^ 

.96473 

2.56 

.98812 

4.1 

.99945 

2.02 

.96541 

2.57 

.988.3*5 

4.2 

.99955 

2.03 

.96609 

- 2.58 

.98858 

4.3 

.99963 . 

2.04 

.96675 

2.59 

• .98881 

4.4 

.99970 

2.05 

0.96740 

2.60 

0.98903 

4.5 ,, 

0.99975 

2.06 

.96803 

2.61 

.98924 

4,6 ** 

. 99980 

2.07 

.96865 

2.62 

.98946 

4.7 

.99983 . 

2.08 

.96926 

2.33 

.98966 

4.8 

.99986 

2.09 

.96986 

2.64 

.98987 

4.9 

.99989 

2.10 

0.97045 

2.65 

0.99007 

5.0 

a. 99991 

2.11 

.97103 

2.66 

.99026 



a 2.12 

.97159 

2.67 

.99045 



® 2.13 

.97215 

2.68 

.99064 



2.14 

.97269 

2.69 

.99083 



2.15 

0.97323 

2.70 

0.99101 



2.16 

.97375 

2.71 

.99118 



2.17 

.97426 

2.72 

.99136 



2.18 

.97477 

2.73 

.99153 



2.19 

.97526 

2.74 

.99170 



SI 
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A 

Accuracy, in calculating statistics, 
230-231 

Agricultural Situation, 80 
Agricultural Statistics, 80 

iJ.S. Department of Agriculture, 
80 

Bureau of Agricultural Eco- 
nomics, 80-81 

• Agricultural Situation, 80 ^ 
Agricultural Yearbook, 80 
Crops and Markets, 80 
American Bankers Association, 51 
Analysis of variance, in mutliple cor- 
relation, 422-429 

in nonlinear correlation, correla- 
tion index, 395-396 
correlation ratio, 373-376 
in simple correlation, 352-353 
Annalist, 534 

Arithmetic charts, 120-131 
Atray, 139-140 
Asymmetry {see Skewness) 
Attributes, variable, 157 
Averages (see P'^requency distribu- 
tions, averages) , 

Avogadro's law, 57 

* 

B 

Banking statistics, sources of, 79- 
86 

Federal Reserve, Board of Gover- 
nors, 82 

Federal Reserve Bulletin, 82 
Member Batik Call Report, 82 
National Monetary Commission, 

• i§ S3 * 

Statesman's Yearbook, 86 


Banking statistics, U.S. Treasury 
Department, 79 

Abstract of Condition of National 
Banks, 79 

Bar charts, 104-105 
Bayes, T., 242 
Bernoulli, Daniel, 242 
Bernoulli, Jacques, 242 
Beta coefficient, 192-193 
Beta cross-product term, 425 
Biennial Census of Manufactures, 
53? 

Binomial distribution, symmetrical 
{see Symmetrical binomial dis- 
tribution) 

Bivariate frequency distribution, 
325-353 

first-order standard deviation, 
relation to r, 351-353 
illustration of, 325-327 
table, 326 

joint variation illustrated (bivari- 
ate scatter diagrams), 339, 
343, 345 

methods of summarization and 
comparison, 327-353 
Pearsonian coefficient of correla- 
lion, 338-349 

analysis of variance, 352-353 
calculation of, 347, 349 
progressions of means, 328-329 
illustrated (graphs), 328-329 
Bivariate frequency surface, 471-486 
bivariate histogram, 469-471 
illustrated (three-dimensional 
diagram), 470 
independent variables, 471 
lines, of regression, 486-488 
mathematical representation, 
487-48» • • ^ 

nonnormal, 491-492 ^ 
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• Bivariate frequency surface, non- 
normal, product-moment for- 
mula for r, cases for use or 
nonuse, 491-^92 

normal, dependent variables, 477- 
486 

derivation of equation, 481-486, 
492-496 

equation of rotated ellipse, 481- 
482 

horizontal cross section, 481 
horizontal view (graph), 477 
illustrated (graph), 480 
mathematical representation, 
482-486 

rotation and narrowing with 
correlation, 488-484 
vertical view, 478 
normal, independent vanables, 

472- 476 

circular form with equal stand- 
ard deviations, 475-476 
illustrated, 472 
normal curve from which de- 
rived (graph), 473 
elliptical form with unequal 
standard deviations, 476 
horizontal section, 476 
illustrated (graph), 475 
mathematical representation, 

473- 476 

Bivariate histogram, 469-471 
illustrated (three-dimensional dia- 
gram), 470 

Bivariate scatter diagram, 365 
Bivariate series, 149-154 
Boltzmann, L., 19 
Boscovich, R. G., 242 • 

Boyle's law, 19, 652 
Bradstreet's index, 522 ^ 

Bureau* of Foreign and Domestic 
* Commerce, 56, 76-77, 512, 535 
Bureau of Home Economics, 48 
Bureau of Labor Statistics, 7, 42-50, 
54, 500, 535, 538 
indexes, 517, 525, 527 
Bureau of Mines, 79^ 

Bqjsiness barometers {see Indexes) 


C 




Cartograms, 112-121 
by bars, 113-119 ' 

by colors and shades, 121 
by cross-hatching, 116-117, 121 
by dots or points, 112-115, 117- 
118, 120-121 
Charles' law, 57 
Charlier check, 209, 354-355 
Charts,‘'100-121 
arithmetic, 129-131 
v^bar, 104-105 
bivariate, 150-154 
component-bar, 106-107 
cross-hatched zone, 107-109 
of frequency distributions, 143- 
149 , 

frequency polygon, 143-147 
histogram, 147 
on a ratio scale, 147-149 
pictogram, 102-103 
ratio, 131-137 

logarithms in relation to, 133- 
137 


sectors of circles, 104, 109-112 
split-bar, 110-112 
of time series, 128-138 
time series in relatives, 130 
types of, 101 

Chi square (x*) curve, 300 

Chi square (x^) test of goodnesp of 


fit, 309-305 
the X® curve, 300 

critical values for 




(table), 304 

weaknesses of test, 305 
Class interval, 144, 164 
Classical concept of probability. 


242-247 


Coefficient, confidence, 311-312 
moment, 180 

of multiple correlation, 398, 416- 
418 

of partial correlation*, 418-422 
of risk, 310-311 
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Coefficient of correlation, arithmeti- 
cal view of, 339-347 
Cbarlier check, 354^355 
computation* from grouped data, 
357-362 

short method, 357, 359-362 
tabulation of given data (table), 
356 

computation from ungrouped 
data, 354-357 
work sheet (table), 355 • 
distinguished from correlation 
ratio, 365 
first-order, 422 
order of, 422 
Pearsonian, 339 

relationship to line of regression, 
^ 349-351 

second-order, 422 • 

third-order, 422 
zero-order, 422 
Combinations, 233-236 
binomial expansion in, 234-236 
defined and illustrated, 233-234 
Combinatorial analysis, problem in, 
270-283 

Commercial and Financial Chroniclcy 

70 

Commercial statistics {see Sources 
, of statistical data, commercial) 
Com'qiodity Yearbook^ 71 
Component-ba# charts, 106-107 
Confidence coefficient, 311-312 
Confidence interval, 313 
Consumers* Incomes in the United 
Statesy 48^ 

Correlation, applications of, by 
social scientists, 324-325 
best 'v^ay of studying, 365 
bivariate frequency table, 357 
coefficient of, Pearsonian, 339 
zero-order, first-order, second- 
order, etc., 422 

multiple (see Multiple correlation) 
nonlinear, 365-396 

{Sei o/so jjCurvilinear regres- 
sion^ 


Correlation, origin and development 
of measurement of, 322-324 
partial (see Partial correlation) 
progress in discovery of, 321-322 
ratio, 365-376 
simple, 321-364 

Correlation coefficient (see Coeffi- 
cient of correlation) 

Correlation index, 394-396 
Correlation ratio, calculation of, 
368-373 

explained, 365-368 
Cournot, A., 12 
Covariance, 405 
Coxe, Tench, 72 
Curve, error, 294-295 
frequency, theoretical significance 
of, 162-166 

Gaussian error, 194, 294-295 
growth vs. frequency, 149 
normal, characteristics of, 265 
formula for, 263-267 
method of fitting to sample 
histogram, 299-300 
normal frequency, 232-320 
characteristics of, 265 
formula for, 263-267 

{See also Normal frequency 
curve) 

probability, 254 
of regression, 367 
standard normal, characteristic of, 
266-267 

formula for, 267 

Curve fftting, curvilinear regres- 
sions, 376-397 

fitting normal curve, 299-300 
fit^ng trends to time series, 564- 
616 

Curvilinear regression, calculation 
of, 376-394 * 

correlation index, 394-396 * 

and analysis of variance, 395-396 
correlation ratio, and analysis of 
variance, 373-376 
calculation of, 368-373 
work sheet (table),*371 
explained, 365-368 
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Curvilinear regression^ estimates 
based on regression equations, 
38^390 

illustrated, by bivariate scatter 
diagram and fitted curve, 377 
relationship in logarithmic form 
(graph), 378 

relationship in reciprocal form 
(graph), 381 
logarithmic, 377-380 
illustrated (graph), 379 
practical estimates based on 
equation derived, 388-390 
standard error of estimate, cal- 
culated, 390-394 
transformation of problem into 
simple linear correlation, 379- 
380 ‘ 

parabolic, 383-388 ^ 

curve fitted directly, 384 
Doolittle method for solving 
three equations, 384-388 
work sheet (table), 386 
practical estimates based on 
equation derived, 388-390 
standard order of estimate, cal- 
culated, 390-394 
reciprocal, 381-383 
illustrated (graph), 381 
practical estimates based on 
equation derived, 388-390 
standard error of estimate, cal- 
culated, 390-394 
transformation to simple linear 
correlation, 351-353 ^ 
standard error of estimate, 399- 
394 

calculated, 391-393 . , 

differences for three types of 
regression, 393-394 
first-order standard deviation 
* used as, 390-391 

summarized with practical esti- 
mates (table), 393 
use of Pearsonian coefficient of 
correlation, in logarithmic 
^ ^ approach, 379 
e in reciprocal approach, 382 


Cycle determination, 637-650 
in annual data, 637-642 
annual trend analysis (table 
and graph), 64b 
cyclical movements shown 
(table), 641 

major cycle and cydle with 
residuals (graph), 642 
danger in extrapolating ‘trends, 
648 

major cycle, 641-642 
method of ratios vs. method of 
differences, 648-650 
in monthly data, adjustment re- 
quired, 637, 642-644 
danger in extrapolating trends, 
648 

, method of determining cycU 
illustrated, 644-647 
where trend is a second- or third- 
degree polynomial, 647-648 
work sheet (table), 643 
Cycles, 545, 637-650 
analysis by empirical trends, 594- 
598 

ogive-like, 552 

D 

Data, cumulative vs. noncumulative, 
127-128 

gathering of, 24-51 ' ' 

construction of questionnaires 
or schedules, 30Ht2 
rational basis for, 28-29 
sampling, 42-49 « 
units of description and meas-« 
urement, 28-48 
(See also Questionnaires; 
Schedules) 

sources of (see Sources of statisti- 
cal data) 

three types of statistical, 4-6 
De Moivre, A., 242 
Density function, 489 ^ 
in description of multiyariate 
distributions, 488^90 
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Determination of normality (see 
Normality) 

Dewey, John, ^2 

Directory of Federal Statstical 
Agencies, 64 

Distribution, of frequency (see Fre- 
quency distributions) 
of probability (see Probability 
distributions) 

symmetrical binomial (see Sym- 
metrical binomial dfstribu- 
tion) 

Domesday Book, 25 
Doolittle work sheet for curvilinear 
correlation, 384, 386-388 
Doolittle work sheet for curvilinear 
regression, work sheet (table), 
• 386 

E 

Eddington, Sir Arthur Stanley, 
18-19, 21-22 
Einstein, Albert, 21 
Empirical trends, 582-598 
analysis of cycles by, 594-598 
straight-line and third-degree 
trends with raw data, illus- 
trated (graph), 597 
• . conclusions from trends de- 
• rived, 598 

•work sh(Jet for trend and index 
of normal (table), 594 
work sheet for trend values, 
method of finite differeiTces 
(tabled, 597 

, finite differences method for trend 
values, 58^594 

aid for computing finite differ- 
*ences at < = 0 (table), 590 
building up a polynomial 
(table), 589 

danger of cumulative error, 
593-594# 

maximum cumulated errors 
• (taBle), 593 

• 'v^brk sheet (table), 592 
polynomial, 58ft-594 


Empirical trends, polynomial, econ- 
omy of calculation in work 
sheet, 583-589 

economical work sheet, alge- 
braic illustration (table), 

585 

economical work sheet, arith- 
metical illustration (table), 

586 

work sheet for second-degree 
polynomial (table), 588 
straight-line trend, 582-583 
work sheet for index of normal 
and trend, 582 

Enumeration, districts, U.S. census, 
35 

problems of, 28-42 
Enumerators, directions to, 35-39, 
42,*44 

training of, 30, 42, 44 
typical problem^ facing, 29 
Equiprobability, ellipsoids of, 489- 
490 

Error curve, 294-295 
Error, standard, of estimate, 383 
for statistics where sampling 
distribution approximates 
normal curve (table), 320 
Estadistica, 90 

Estimates, manufacturers, 7 
Euler, L., 242 

Extrapolation of trends, danger in, 
648 

F 

• 

Federal Reserve Bank of New York, 
517 

Monthly Review of Credit and 
• Bicsiness Conditions, 534 
Federal Reserve Board, 500, 532 
index, 533 « 

Federal Reserve Bulletin, 86, 531-532 
Federal Reserve System, 512 
Federal statistical agencies, 71-84 
(See also Sources of statistical 
data) 

Financial statistics, sources of (see 
Sources of statistical* Sata, 
finaipicial) #» 
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Finite differences, method of finding 
trend values, 589-694 
danger of cumulative error, 
69a-594 

First-order standard deviation, de- 
finition, 390 
Fisher, Irving, 518, 530 
Forecasting, 651-680 
agencies, 661-662 
Babson, 661, 671-672 
Brookmire Economic Society, 
661, 671-672 

Harvard Economic Society, 
661, 671-673 

Moody’s Investor’s Service, 
661, 671-672 

Standard & Poor’s CJorporation, 
661, 671-672 

ancient origin of pseudo-dcientific, 
651 

combined seasonal and cyclical, 
677-678 

illustrations of, 678 
commercial uses of, 661-662 
cycles with time series, 661-675 
general business conditions, 
663-673 . 

business barometer, 664, 666, 
670 

combination indexes, 666- 
667, 670 

crosscut analysis method, 
66^665, 671-673 
historical analogy method, 
663-671 

indexes of national-income, 
667-670 

indexes of. physical volumi'^ of 
production (Babson index), 

, 670-671 

^ lead-lag difficulties in fore- 
casting, 670 

ty]^ ei indexes, 664-673 
particular lines of activity, 673- 
675 

crosscut analysis method, 675 
trude*' historical analogy 
« method, illustrate}!, 673 


Forecasting, cycles with time series, 
particular lines of activity, 
cycle hypothfisis for, 674-675 
lead-lag relationships, 674 
from distribution studies, 656-661 
bivariate distributions, 657-658 
errors of forecasts, 659^ 
monovariate distributions, 656- 
657 

multivariate distributions, 658- 
659 

modern scientific, 652-656 
conditional, 653-654 
illustrations of, 654-656 
popular dramatization of fore- 
casts, 652-653 

qualitative vs. quantitative, 654 
use of statistics in, 656 < 

quality and effect of economic 
forecasting, 679-680 
with seasonal variation, 675-678 
historical analogy, 676-677 
trends with time series, less exact 
forecasting, 660-661 
more exact forecasting, 659-660 
Foreign trade statistics, sources of, 77 
Fourier's theorem, 561 
Fr^chet, Maurice, 250 
Frequency concept of probability, 
247-249 

Frequency curves, definition, 631-164 
derivation from histograms, 162- 

164 

formulas for, 263-267 
normal (see Normal frequency 
curve) , 

uses of, 164-166 

in graduating observed data, 164- 

165 

as a norm, 165 * 

in sampling analysis, 165-166 
Frequency distribution analysis, 
numerical computation, 199- 
231 

arithmetic mean, rule for, 214 
averages and variabili^iy, diffi- 
culties in locating medii&n and 
mode, 216-317 
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Frequency distribution analysis, 
numerical computation, beta 
coefficients, 216 
calculations, 216-227 
average deviation, 220-224 
histogram assumption in 
grouped data, 221-224 
mid-value assumption in 
grouped data, 220-221 
averages and variability, diffi- 
culties in locating median 
and mode, 216-217 
coefficients, of skewness, 220- 
227 

of variability, 225-226 
measures of skewness, 225 
median and quarliles, 210-220 
mode, 217-218 
semiquartile range, 224-225 
construction of class interval, 
199-206 

determining the class inter- 
val, effect of too many 
intervals, 202 

interval size chosen to re- 
veal character of varia- 
tion, 202-207 

illustrative material, distribu- 
tion with various class inter- 
vals (tables), 203-205 
mean square deviation, 215 
moments about the arbitrary 
origin, 207, 211-212 
moments about the arithmetic 
mean, 212-214 * 

scatter .diagram and graph, 201 
standard deviation, 215-216 
with unequal class intervals, 
228-230 

Variability and skewness, graph- 
ic interpretation of, 227-228 
work sheet, 206-216 
Charlier check for, 209 
entering the distribution, 208 
illustrated (table), 210 
saving calculation, by obtain- 
ing * moments about an 
f|?rbitrp^ origin, 207 


Frequency distribution analysis, 
numeral computation, work 
sheet, saving calculation, 
in use of work sheet, 208- 
209. 

by using class-interval units, 
207-208 

theory, 158-198 
averages, 167-180 
arithmetic mean, 167-170 
concept of, as summary fig- 
ures, 179-180 
geometric mean, 173-176 
harmonic mean, 176-179 
median, definition, 170-172 
theory, mode, definition, 172- 
173 

basic formulas used in, suin- 
Inary of, 198 

beta coefficients, 192-193, 195 
bivariate (see Bivariate fre- 
quency distribution) 
charts of, 162-164 
histograms, 162-163 
area histograms, 162 
relative frequencies in, 162- 
163 

determination of normality of, 
297-306 

frequency curves (see Frequency 
curves) 

kurtosis of, 193-196 
measurements of summarization 
^and comparison, 166-182 
measures of variability, average 
deviation, 188-184 
quartiles of, 171-172 
• range, 182-183 

standard deviation, 184-185 
variance, 166, 185 ^ 
moments of, 180-182 ^ 

the centroid, 181 
moment coefficient, 180 
purpose of, 181-182 
multivariate (see Multivariate 
frequency distribution) 
of populfiftions, 1^6-167 * 
parameters of, 167 • 
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Frequency distribution analysis, 
theory, possible types of com- 
parison, 16^162 
as probability distributions, 254 
of sample data, 167 
statistics of, 167 
sampling distributions {see 

Sampling distributions) 
skewness of, 185-193 
symbols used in, summary, 197 
trivariate (see Multivariate fre- 
quency distribution, tri- 

variate) 
use of, 158-159 

Frequency distributions, 140-149 
conventional manner of graphing, 
143-144 

discrete vs. continuous, 142-143 
irrational, 156 

nature and illustration of, 140-142 
rational, 155-157 

Frequency polygon, relative slope at 
a given point computed (graph), 
288 

Frequency series, 138-143 
definition, 138-139 

{See also Frequency distribution) 

Frequency surfaces, 469-492 
bivariate, 471-486 
bivariate histogram, 469-471 
multivariate (density function), 
488-491 

Frequency table, 143 

Functions, compound-interest, 261- 
262 

discount, 263 
explicit, 255-256 
exponential, declining, 262-263,, 
rising, 261-262 

functional relationships, 255-257 
hvperliolic (table), 695-696 
implicit, 255-256, 250-260 
joint, 255 
linear, 256 
nonlinear, 256-257 
simple, 257-263 

^ {Gee <hso Sim|)le functions, 
c graphs of) 


G 

Galton, Sir Francis, J.4, 196, 293, 
323-324 

Gaussian error curve, 194, 294-295 
Geological Survey, 533 
Gompertz logistic curve, 554' 
Goodness of fit, described, 287 
illustrated (graph), 288 
test of, 300-305 

t8ec also Chi square (x*) test 
of goodness of fit] 

Government Publications and* Their 
Usej 65 • 

Graphs, {see also Charts) 
of simple functions, 257-267 

{See also Simple functions, 
graphs of) < 

Graunt, John, 65-66 
Growth, curves, 149 

{See also Rational trends) 
explanation of, 553 
Guides to sources, 62-65 
governmental, 64-65 

Directory of Federal Statistical 
Agencies, 64 

Government Pvblications and 
Their Use, 65 

U.S. Government Manual, 64 
U.S, Government Publications, 
65 

nongovernmental statistics, 62^^64 
handbooks and general index 
material, 63-64 
magazine indexes, 62-63 
Guilds, early sources of statistics, 
25 

H 

Handbooks, 57 

Harvard College Observatory, 13 
Harvard index, 665-666 
Heisenberg, W., 19^20 
Heisenberg’s uncertainty measure- 
ment, 19-20 o 

Histograms, in frequency distedbu-* 
tions, 162-163 ^ « 
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Hollerith, Herman, 65 

Hollerith tabulating machines, 65, 
72-73 

Huygens, C., 342 

Hyperbolic functions (table), 695- 
696 

Hyperplane of regression, 490-491 
^ I 

Index to Business Indexes^ 63 

Index chart, capital formation, 669 
consumer spending, 668 
Harvard, 665 

Indexes, adjustment to bench marks, 
535-542 

ideal conditions for stratified 
sampling absent, 535-536 
method of adjustment illus- 
trated, 538-542 

monthly indexes adjusted to 
census figures (table), 540- 
541 

reasons for adjustment, 536-538 
of correlation {see Correlation 
index) • 

of general business conditions, 
533-535 

Harvard, 665-666 
method ^ of computation illus- 
trated, 528 

price indexcjg, aggregative, using 
given-year weights, 529 
of production, 82, 630-531’ 
quantity indexes, and business 
barometers, 530-535 
stratified sampling in, 530 
of trade and production, 530- 
531 

computation of weights illus- 
trated, 532-533 
weighted by prices, 529-630 
relative series from time series 
(chart), 130 • 

U.S. Bureau of Labor Statistics, 
constru^ion of indexes, 525- 
• 5*7 • 


Index numbers, 497-542 
application of sampling technique, 
513-514 

stratified sampling, 515-516 
composite, 612-513 
stratified sampling in construc- 
tion of, 513 

construction of, aggregative 
method, simple, 522-524 
weighted, 524-625 
average-of-relatives method, 
simple, 518-520 
weighted, 520-522 
methods in general, 518 
conversion of absolute to relative 
numbers, 500-511 
absolutes, 500-501 
relative parts of a whole, 509-- 
5V 

relatives, 501-503 
relatives using a base period in 
time series, 503-505 
presumption of normality in 
base selected, 505-509 
history of discovery and use, 497- 
500 

simple, great variety in use, 511- 
512 

variety of purposes of, 516-518 
Industrial statistics, sources of, 66, 
77, 83 

Bureau of Manufactures, 77 
The Economic Almanac j 67 
Industrial Commission, 83 
/.Q., 10 • 

International Statistical Yearbook^ 85 
Intuitive-axiomatic approach to 
probability, 250-251 
Iron^Agej 68 

J • 

Journal of .the American Statistical 
Assodationy 534 

K 

King, W. I., 516* 

Kolmogoroff, A., 250 
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Kurtosis, 162, 196-196 
Kuznets, Simon S», law of growth, 
explanation of, 554r-558 

L 

Labor statistics, sources of 61, 74- 
76, 84 

Commission on Industrial Rela- 
tions, 84 

Department of Labor, 61 
Bureau of Labor Statistics, 78, 
86 

Monthly Labor Review^ 78 
Lagrange, J. L., 242 
Lambert, J. H., 242 
Laplace, P. S., 242-243, 250 
Law of large numbers, 239-240 
League of Nations, 88-91 
indexes, 512 r 

publications, 512 

Lo^ast squares, method of, to find 
line of regression in bivariate 
frequency distribution, 331-334 
Legendre, A. M., 242 
Line of regression, 329-335 
becomes hyperplane of regression 
in, multivariate distribution, 
490-491 

derived by method of least 
squares, 331-334 
interpretation of, 336-338 
means of rows and columns 
(table), 330 

relationship to r, 349-351 
standard deviation about' means 
or line of regression, 336-338 
standard deviations for columns 
of data given (table), *837 
of Hi on 330—334 
illustrative diagram, 332 
of Xa'on Xi, 336 
^ illustrative diagram, 334 
lineal' plane of ;:egression; second- 
order variances for, 413-416 
lines of regression, in bivarate fre- 
quency distribution, calculated 
<^rq;Kr -giT^en data^ after comput- 
inrr, 363-363 


lines of regression, work of compu- 
tation in fitting to time series 
when more than two coeffi- 
cients, 599 * 

Loci of equiprobability, in mulit- 
variate frequency ‘^surface,'' 489 
Logarithmic charts (see Ratio charts) 
Logarithmic regression, 377-380 
Logarithms, of numbers, fouvplace 
common (table), 681-684 
scaleJor ratio charts, 131-137 
Logistic growth curves, (see Rational 
trends) 

M 

Magazine indexes, 63-63 
Market Research Series, 63 
Mt,ximum likelihood, method iov 
single best estimate of popula- 
tion percentage in sampling, 
314-315 

Means, progressions of, graph, 328- 
329 

Measurement of General Exchange- 
Value, The, 50p 
Minerals Yearbook, 58 
Mises, Richard von, 245-251, 269 
Mitchell, Wesley C., 500, 514, 530, 
553 

Business Cycles — The Problem arA 
Its Setting, 499, 513, 535 •- 
law of growth, explanation of, 553 
MorUhly Bulletin of Statistics, 85 
Mo^hly Labor Review, 78 
Multiple correlation, 397-436 
analysis of variance Li, 422-429 
coefficient of direct determina-^ 
tion, 424 

illustrated (diagram), 424 
coefficient of joint determina- 
tion, 426 

coefficient of miiltiple correla- 
tion, 426-428 

coefficient of net regression, 
42^25 

r-beta cross-prodjict term, 424- 
426 
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Multiple correlation, analysis of vari- 
ance in, r beta cross-product 
term, ^lustrated (diagram), 
424 

residual variance, illustrated 
(diagram), 424 

analysis of variance and causal 
.. relationships, 428-429 
coefficient of, 398, 416-418 
definition, 397-399 
extended to any number <Jf vari- 
ables, 434-436* 

extension of formulas, high- 
^ order variances, 436 

multiple-correlation formulas. 
436 

partial-correlation formulas, 
, 436 

statistics for rcgressilbn 
planes, 435-436 
general approaches, 434r-436 
extension of analysis to four 
variables, 429-434 
multiple correlation coefficient, 
434 

partial correlation in four-vari- 
able case, 433-434 
in terms of correlation statistics 
of same order, 432-433 
in tQrms of lower-order correla- 
tion statistics of same kind, 

431-432 

*n terms of lower-order r’s and 
<r's, 432 

third-order variance, 434 • 

linear vs. nonlinear relationships, 

399-&0 

notation used in, 401-404 

meaning of subscripts before 
• and after point, 402 
symmetry of, 404 
partial correlation {see Partial 
correlation) 

Multiple linear ipegression, 410-416 
beta form of regression equation, 
4;0-4J3 

obtained ^by method of least 
squares, ^0 


Multiple linear regression, beta form 
of regression equation, a’s and 
b's calculated from beta form, 
412-413 

second-order variances for linear 
plane of regression, 413-416 
Multivariate frequency distribution, 
404-410 

analysis, illustrated, 437-468 
trivariate statistics, interpreta- 
tion of results, illustrated, 
analysis of variance in X, 
451, 455 

estimates based on regression 
equation, 450-451 
partical-correlation coeffi- 
cients, 451 

best approaches in studying, 409- 
#10 

calculation of trivariate statistics, 
44^450 

* all-round check on, 450 

equations of three planes of 
regression, as found, 448-449 
first-order correlation statistics, 
445-450 
a statistics, 448 
h statistics from the beta's, 448 
coefficients of partial correla- 
tion, 447-448 

first-order beta's from zero- 
order r's (table), 446 
interpretation of results, illus- 
trated, 450-451, 455 
multiple-correlation coefficients, 
449-450 

second-order standard devia- 

* tions, 449 

zero-order correlation statistics, 
444^445 -* 

examination of, 437, 442-444 ^ 
by testing net regression, 437, 
.442-443 

trivariate, 406-410 

conditions for independence of 
all variables, 40^-409, • 
illustrated (diagram), 405 
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' Multivariate frequency distribution^ 
trivariate, studied by breaking 
up into bivariate distributions, 
406-408 

trivariate analysis illustrated, cor- 
relation tables of, 406-407 
correlation table of, Xi and Xi, 
406 

X 2 and Xi, 406 
Xi and Xi, 407 

Multivariate frequency surface, non- 
normal distribution, 491-492 
normal, 488-491 
deviations normally distributed, 
490 

effect of greater correlation on 
ellipsoid shape, 490 
ellipsoids of equiprobability, 
480-491 ^ 

illustrated (graph), 490 
in reality a density function, 489 

N 

National Bureau of Economic Re- 
search, 67 

National Industrial Conference 
Board, 66 

National Research Planning Board, 
publications, 84 

National Resources Committee, 48 
New Jersey State Labor Dept., 538 
New York Times, The, 534 
Newtonian mechanics, 19-21 
Neyman, .N., 236, 250 ' 

Nonlinear correlation, 365-396 

{See also Curvilinear regression) 
Nonnormality, in bivariate or multi- 
variate distributions, 491-492 
of population in sampling, 316 
Normal frequency curve, 232-320 
atgebridc and graphic ^presenta- 
tion of, 263-267 
algebraic formula, 264 
graph, 263 

. graphs of curves with different 
^ means^' and same standard 
^ deviations, 265 


Normal frequency curve, algebraic 
and graphic representation of, 
graphs of curves with same 
mean and different standard 
deviations, 266 
areas under (table), 693 
fitted to histogram of giv^ data 
(graph), 295 

method of fitting to sample histo- 
gram, 269-300 
ordinates of (table), 694 
real life conditions producing, 
290-297 

recurrence in statistical analysis, 
264 

standard normal curve (graph), 
267 

and symmetrical binomial distri-, 
bution, 279-306 
use in theory of sampling, 264 
useful approximation to binomial 
distribution where N is large, 
308 

Normal frequency surface, 469-496 
(See also Bivariate frequency 
surface; Multivariate fre- 
quency surface) 

Normal probability curve, {see Nor- 
mal frequency curve) 

Normality, determination of, in bi- 
nomial distribution, 297-30^ 
by comparison of special statis- 
tics, 305-306 

in frequency distributions, 297- 
c 306 

by graphic comparison, 298-300 
fitting normal cu?ve to sam- 
ple histogram, method of, 
299-390 

by test for goodness of fit {see 
Goodness of fit, test of) 
in time-series, indexes, 505-509 
of population, in sampling, 316- 
317 

( 

O 

k 

Order, of correlation coefficient^ 42% 
of correlation statistics^ 422 
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Order, designation in correlation sta- 
tistics, indicates combination of 
variable^, 422 
of regression statistics, 422 
of standard deviations, 422 
of variance, 363-364, 414-416 
Orthogonal-polynomial trends, 599- 
616 

calculation of coefficients A, B, 
C, . . . , 606-607 
by subtotal summation'^type of 
work sheet, 608-612 
orthogonal polynomials, defini- 
tion, 600-601 

forms used in fitting trends, 
603-606 

tables to save calculation, 612-615 
values of specified variables, 
dependent on number ** of 
years (tables), 613-615 
trend line by method of least 
squares, 601-603 
uses in trend analysis, 599-600 

P 

Parabolic regression, 383-388 
Parameter, definition of, 167 
Partial correlation, coefficient of, 
418-422 

^ calculation, 420-422 
definition, 400-401 
notation i^sed in, 401-404 
obtained between two variables 
by holding third variable con- 
stant, 419-420 • 

Pascal, B., 242 

Pearl-Reed population curve, 549 
Pearson, Karl, 66, 293-294, 323- 
325, 339 

Pearson-Galton apparatus for bi- 
nomial distribution, 293-294 
illustrated, 293 

Pearsonian coefficient of correlation, 
338-349 ^ 

arithmetic view of r, 339-347 
Pepin the Short, 24 
Percentage, p^ulation percentage, 
313-3>5 


Periodogram, 561-562 
Permutations, defined and illus- 
trated, 232-233 

Persons, Warren M., 66, 530, 623 
Petty, Sir William, 65-66 
Pictograms, 102-103 
Planck, Max, 19 
Playfair, William, 100-101 
Polling agencies, 6 
Polynomials, definition, 256 
first-degree, graph of, 257-258 
implicit, 255, 259-260 
second-degree, graph of, 258-259 
Population, curves, 548-549 
laws of growth, 549^550 
technical term in frequency dis- 
tribution, 166 
theories, early, 549-550 
Prescott, Raymond B., 554 
Presentation of statistics, 92-121 
cartograms, 112-121 

{See also Cartograms) 
charts, 100-121 

{See also Charts) 
tables, 92-100 

Probability, combinations, 233-236 
concepts of, 236 
classical, 242-243 
criticism of classical concept, 
243-247 

meaning of “equally likely, 

244 

principle of indifference, 244- 

245 

principle of sufficient reason, 
245-246 

subjective character of, 246- 
. 247 

frequency concept, 247 

criticism of von^ Mises’ 
theory, 247-250 
intuitive-axiomatic approach, 
250-251 

curve, formulas for, 263-267 

{See also Frequency curves; 
Normal frequency curve) 
definition, 2S7 ^ 

dependent, 271-272 
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' Probability, empirically determined, 
240-241 

independent, 270-271 
independent vs. dependent illus- 
trated in real life case, 270-274 
law of large numbers, 239-240 
permutations, 232-233 
of possible combinations of 10 
coins (table), 282 
randomness, 241-242 . 
and relative frequency of actual 
events, 239-242 
Probability calculus, 268-278 
addition theorem, 268r-269 
for dependent probabilities, 271- 
272 

examples of calculation, for dis- 
crete distributions, 274-276 
for continuous distrioution, 
276-278 

independence vs. dependence il- 
lustrated in real life case, 
273-274 

for independent probability, 270- 
271 

multiplication theorem, 269-274 
statement, 269 

Probability distributions, 252-267 
continuous, 253-254 
discrete, 253 

functional relationships in (see 
Functions) 

identical with certain types of 
frequency distribution, 254 
probability curve, 254 

(See also Normal frequency 
curve) 

Probability sets, calculation of “de- 
rived'^ or ** second-order sets, 
2694r. 

finite, multiplication theorm vaUd 
for, 272 

fundamental, 237-238 
infinite, 238-239 
multiplication theorem valid 
^ for, 272 « 

PfC^Um of Estimation, Tke^ 500 


Product deviation, measurement of, 
339-347 

Product-moment coefiicient of cor- 
relation, 339j5^. 

Product-itioment forniula for r, use 
in nonnormal frequency distri- 
butions, 491^92 
Product term, definition, 485 
disappears where correlation is 
absent, 485 

Public opinion, sampling of, 6 
Publications, statistical (see Statisti- 
cal publications) 

Q 

Quality control, 18, 248 
Quantum theory, 19-21 
Quartiles, calculation of, 218-220 
definition, 171-172 
interpretation, 227-228 
use in measuring skewness, 189- 
191 

Questionnaires, mailed, 48-49 

good-will letter used in support 
of (typical form), 50 
rules for constructing, 49, 51 
(See also Schedules) 

Qu4telet, A., 87, 499, 513, 549-550 

II 

Ratio charts, 131-137 ^ 

advantages and disadvantages of, 

135- 137 

paper used for, 133-135 
relative growth shown on, 131, 

136- 137 

three scales Of paper used for, * 
134-135 

value for comparisons impossible 
on arithmetic paper, 136-137 
Ratio scale (see Semilogarithmic 
paper) 

Rational trends, 547-558, 574-581 
dying institution, illustrated, 475- 
577 , . 

possible trends ^n dyingdnst^ 
tution, 574 o 
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Rational trends, dying institution, 
illustrated, trend fitted by 
metl^od of least squares 
(graph), 577 

work sheet for annual index 
of normal and trend 
' (table), 676 

grq^ring institution, illustrated, 
578-581 

curve fitted by method of 
selected points (gra]fh), 581 
method of selected points, 
578-581 

* work sheet for index of nor- 
mal and trend (table), 580 
Reciprocal regression, 381-383 
Reciprocals of numbers (table), 691- 
• 602 

Regression, linear plane of, 397-3^9, 
410-413 

(See also linear plane of 
regression) 
logarithmic, 377-380 
inultiple linear, 410-416 
parabolic, 383-388 
reciprocal, 391-383 
statistics, order in, 422 
Relative frequency, probability, 280 
Relativity theory, 21 
Research associations, 66-68 
Review of Economic Statistics, 66, 
\ 500, 63Q 

S 

Sampling, 42-48 • 

by Bureau of Labor Statistics, 
42-45 

fitting of normal curve to sample 
histogram, 299-300 
in •government study of family 
income and expenditures, 48 
of means, 315-319 
population mean, confidence 
limits for, 318 
estimate of, 318 
testi]^ a hypothesis about, 
. * 317-«18 

sampling distribution, 315-316 
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Sampling, testing a hypothesis, 317- 
318 

in 1940 U.S. C!!ensus, 42 
of percentages, 307-315 
coefficient of risk, 310-311 
confidence coefficient, 311-312 
confidence interval, 313 
population percentage, deter- 
mining confidence limits 
for, 311-313 

likelihood of, defined, 315 
likelihood of, relation to 
probability of sample (dia- 
gram), 314 

maximum likelihood, estimate 
of, 313-315 

testing hypothesis about, 
309-311 

sailhpling distribution, 307-309 
statistical inferences from, 309- 

315 

types of inference, 309 
typical problem, 307 
random, 241-242, 307 
relative frequency of samples 
follows binomial distribution 
pattern, 308 

sampling distributions, (see Sam- 
pling distributions) 
standard errors for selected statis- 
tics, where distribution ap- 
proximates normal curve 
(table), 320 

stratified, in construction of index 
numbers, 515-516 
use of normal frequency curve in, 
307-320 

* conclusions as to, 319-320 

used in business, 9 

of variances, population -Variance, 

316 • 
confidence limits for, 319 

. optimum estimate of, 319 ^ 
testing a hypothesis, about, 
318-319 

sampling •distribution, 515-316 
standard deviation, the, 317 
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c 

Sampling distributions, 307-317 
explained, 307-308 
of sample means, 315-316 
of sampte percentages, 307-308 
of sample variance, 316-317 

Schedules, coding of, 53-55 
editing of, 52-53 

mailed questionnaires (see Ques- 
tionnaires, mailed) 
problems of enumeration, 28-42 
questionnaires {see also Question- 
naires) 

tabulation of, 55 

{See also Tables; Tabulation) 
units of description and measure- 
ment, 28-34 

illustrations of government care 
in, 35-48 

Seasonal variation, 617-636 
causes of, 618-621 
historical background of study, 
617-618 

in labor, McCabe, 620 
measurement illustrated, 625-633 
calculation by 12 months’ mov- 
ing average method, 625, 
636-633 

completed index (table'), 631 
multiple frequency array, deter- 
minations from, 633 
illustrated (graphs), 631-632 
work sheet for calculating index 
(tables), 626-630 

method of detecting change in, 
633-636 

computation of index for single 
year (table), 636 
index required for each year 
because of observable trend, 

m 

' trends in seasonal variation 
illustrated (graphs), 634-635 
methods of measuring, 621^25 
Kemmerer, 623 
Persons, 623 

problem (ff isolatiiig, 621-625 
. link relative method, 623 


Seasonal variation, problem of iso- 
lating, ratio-difference-from- 
trend method, «624 
twelve months’ moving average 
method, 624r-625 
various suggested methods, 
bibliography for, 624n 
testing whether well definejJv 632 
trend in, 634-635 

Second-order, indicates statistic with 
two figures to right of decimal 
in subcript, 422 

Semilogarithmic paper, 133-135 
Series, bivariate, 149-154 * 

frequency, 139-149 
Sheppard’s correction, 299 
Shewhart, W. A., 248 
Significant figures, meaning of, 23(W 
231 

Simple correlation, 351-353 
lines of regression, calculated, 
362-363 

Simple functions, graphs of, 257-267 
circle, 259-260 
ellipse, 260 

exponential function, declining, 
262-263 
rising, 261-262 

first-degree polynomials, 257- 

258 

normal frequency curve, 263-267 
second-degree polynomials, 258- 

259 

Simpson, C. G., 14, 242 
Single best estimate, of population 
percentage in sampling, 314-315 
Skewness, definition and significance 
of, 185-193 

measurement of, by beta coeffi- 
cients, 192-193 

by medians and quartiles, 189- 
191 

by relation of mean, median, 
and mode, 185-189 
by third moment, 191-192 
Smith, Adam, 11-12 
Smithsonian Institution, 13 « 

Social Science Rei^arch ^Council,. 48 
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Sooial Security Administration, 637 
Sources of statistical data, 56-91 
{See al8€ Guides to sources of 
statistics) 

agricultural, (see Agricultural sta- 
jiistics) 

banl^g, (see Banking statistics, 
X sources of) 
commercial, 68-71 
commercial and financial publi- 
cations, 70-71 
trade associations, 69-70 
^ trade journals, 68-69 
federal, 71-84 

Congressional investigations, 82 
financial, 70 

commercial and financial publi- 
cations, 70-71 

general summary, developing pat- 
tern of sources, 59-62 
for social sciences, 57-58 
guides to (see Guides to sources) 
industrial, (see Industrial statis- 
tics, sources of) 
international, 91 

» on labor, (see Labor statistics, 
sources of) 

pattern of existing (outline), 61-62 
primary vs. secondary, 56 
. private research, individuals, 61, 
. 65-66 

• handbooks on, 63 

research associations, 66-68 
(/See also Statistical publica- 
tions) • 

state and^ municipal, 84-85 
on trade, (see Trade) 
on transportation and com- 
munication, (see Transporta- 
* tion and communication 
statistics) 

world statistics, 85-91 
best sources of, 91 
Split-bar charts, 110-111 
Square foots of numbers, 100-1000 
table, 68.9-690 
Squ£llH3S, 100^990 
of numbers i^able), 685-686 


Standard deviation, first-order, 338 
from lines of regression, 33^338 
zero-order, 338 

Standard error, of variance, 316 
of estimate, 338 
definition, 390 

Standard Indvstrial Ossification 
Code, 54 

Statesman's Yearbook, 86 
Statistic, definition of a, 167 
Statistical Abstract of the United 
States, 64 

Statistical Allas, 73 
Statistical data (see Data) 
gathering of (see Data, gathering 
of) 

Statistical laws, 19 
Statistical publications, abstracting 
agencies, bin 
world statistics, 85-91 

(See also Guides to sources) 
Statistics, accuracy in calculating, 
230-231 

in the arts and sciences, 1-23 
in astronomy, 12-13 
in biology, 14r-16 
in business administration, 7-9 
definition and meaning of, 1-4 
descriptive, 232 
in economic theory, 11-12 
in education, 9-11 
in engineering, 16-18 
forecasting by means of, 651-680 
gatAering of, 24r-55 
historical development in, 24-28 
(See also Data, gathering of) 
Jn governmental administration, 
6-7 

in medicine, 16 
and philosophy, 21-22 ^ 
in physics and chemistry, 81-51 
in politics, 6 

presentation of (see Presenta^j^ 
of statistics) 
in sociology, 11 

sources of ^ee Souibes ofiisHitisti- 
-- cal data) 
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Statistics, summarization and com- Symmetrical binomial distribution, 


parison by means of fre- 
quency distributions, 16S- 
198 

by means of index numbers, 
497-642 

measurements for, 166-182 
symbols used in (see Symbols used 
in statistics) 

theoretical, definition, 232 
by use of index numbers, 497^. 
in zoology, 13-14 

Summarization and comparison, in 
bivariate frequency distribu- 
tions, 327—363 
measurements of, 166-182 

Survey of Current Business, 77, 612, 
625, 531, 634 

Symbols used in statistics, 122-129 
basic symbols, 122-124 
multiple and partial correlation, 
401-404 

time series, 124-126 
passage of time, 124r-125 
units involved, 126-127 
where variable fluctuates with 
time, 125-126 

Symmetrical binomial distribution, 
character of, 283-285 
mean, 283-284 
moments, 286 
symmetry, 283 
variance, 284r-286 
derivation, 289-283 
graph of, 284 

and the normal curve, 279-306 
beta values approach those of 
normal curve, 289 
dist^buticn approaches normal 
^ curve as limit, 286-290 
graphic comparison, 288 
relative slope of frequency poly- 
gon and normal curve com- 
pared, 289 

' real life conditions producing, 
® £90-297 

c summary, 296-297 , 


relative slope of frequency poly- 
gon computed for a given point 
(graph), 288 

for two values of N (graph), 286 
effect of scale adji^tments 
(graph), 287 

seen in relative frequ^cies, 
282, 291-293 

Symmetry, in frequency surfaces, 
48‘’6-487 

in notation for multiple and par- 
tial correlation, 404 
writing of equations by, illus- 
trated, 431-432 

T 

Tables, 92-100 • 

construction of, 92-93, 95-96 
general-purpose, 93-94 
special-purpose, 93, 95-97, 100 
types of, illustrated, 94-99 
Tabulation, machine, 66, 72-73 
mechanics of, 73 
principles of, 92 
(See also Tables) 

Test of goodness of fit, 300-306 
(See also Chi square (x*) test of 
goodness of fit) 

Theorem, Fourier^s, 561 
Theory of errors, 294-296 
not intended to mask inaccuracy 
of calculation, 230 
Theory of relativity (see Relativity 
** theory) 

Third-order statistics, ^2 
Time series, analysis of (see Cycle ^ 
determination; Seasonal varia- 
tion; Trend analysis) 
careful description of umts in- 
volved, 126 

conventional charting of, 128-129 
cumulative vs. noncumulative 
data, 127-1?8 

elements of variation in^ 643-647 
cycle, 644-647 * 

long-term growth*br trendf 643^ 
547 
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Time series, elements of variation 
in, residual fluctuations, 574 
seasonal wiriations, 543-647 
hypothetical, showing elements of 
variation (table), 644 
rationi^l basis of analysis of, 543- 
663 

rathjfial trends, 547-558, 664 
Time-series analysis, development of 
technique for, 560-563 
harmonic (periodogram) analy- 
sis, 561-562 

^major cycle determination by 
Kuznets’ methods, 561 
ordinary and minor cycle deter- 
mination by empirical 
methods, 561-562 
use of functions of arc tanger^t, 
562 

use of orthogonal polynomials, 
562, 599-616 

empirical trends, 558-560 
application to cycle analysis, 
558-560 

empirical vs. rational trends, 564 
rational basis for, 543-563 
rational trends, application to 
social philosophy, 553-558 
basis for rationalizing, 550-552 
, criticism of, 552-553 
^early population theories, 549- 
• 550 • 

historical background, 547-548 
population curves, 548^549 
\See also Cycle determinati<5n; 
Seasoi^l variation) ; Trend 
analysis 

^rade. Department of Commerce, 86 
Commerce Yearbook, 86 
domfestic, 81 

Federal Trade Commission, 81 
foreign, 77, 82, 86 

Bureau of Foreign and Domes- 
tic Comi^erce, 77 
Statistical Abstract of the 
JUnifed States, 77 
Purvey of Current Business, 77 
Statesman^ Yeqyhook, 86 


Trade, U.S. Tariff Commission, 82 
Transportation and communication 
statistics, 81 

Interstate Commerce Commis- 
sion, 81 

Treasury Department, 7 
Treatise on Money, J. M. Keynes, 
617 

Trend analysis, 564-616 
detecting cycle by removing em- 
pirical trend, 565 
empirical vs. rational trend, 564 
empirical trends, illustrated, 582- 
598 

analysis of cycles by empirical 
trends, 594-598 

finite differences method for 
fifiding trend values, 589-594 
polynomials, 583-594 
straight-line trend, 582-583 
methods of fitting trend, 565-574 
by averages, 573 
moving averages method, 
573-574 

by least squares, 565-573 
advantages of method, 674 
basic method, 565-568 
numerical illustration, 568- 
569 

probability theory not ap- 
plied, 570-571 

second- or third-degree 
9 curves, 669-570 
by selected points, 571-573 
orthogonal-polynomial trends {see 
Orthogonal-ploynomial 
trends) 

rational trends, illustrated, 574- 
681 ^ 
dying institution, 674-677 ® 
growing institution, 578-681 
Trends, empirical {see Empiric^ 
trends) 

orthogonal-polynomial {see Orth- 
ogonal-pdlynomisA trends^ 
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